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ABSTRACT 

Molecular  surveys  have  revealed  that  microbial  communities  are  extraordinarily  diverse. 
Yet,  two  important  questions  remain  unanswered:  how  many  bacterial  types  co-exist,  and 
do  such  types  form  phylogenetically  discrete  units  of  potential  ecological  relevance?  This 
thesis  explores  these  questions  by  investigating  bacterial  diversity  in  two  complex  marine 
communities  (coastal  bacterioplankton  and  sediment  sulfate-reducing  bacteria)  by  (i) 
comprehensive  analysis  of  large  16S  rRNA  clone  libraries,  and  (ii)  refinement  and 
application  of  parametric  diversity  estimators.  Identification  and  correction  of  sequence 
artifacts  demonstrated  their  potentially  significant  contribution  to  diversity  estimates. 
Still,  hundreds  of  unique  rRNA  sequences  (ribotypes)  were  detected  in  the  corrected 
libraries,  and  extrapolation  to  community  diversity  with  commonly  used  non-parametric 
diversity  estimators  suggested  at  least  thousands  of  co-existing  ribotypes  in  the  two 
communities.  However,  close  inspection  revealed  that  the  non-parametric  estimators 
likely  lead  to  underestimation  of  ribotype  diversity  in  the  clone  libraries.  Thus,  an 
improved  parametric  method  was  developed  and  shown  to  closely  fit  the  data.  The 
extrapolated  total  ribotype  diversity  in  the  sample  by  the  improved  method  was  up  to  one 
order  of  magnitude  higher  than  estimated  with  common  non-parametric  approaches. 

Most  significantly,  the  compensation  for  artifacts  and  improved  estimation  revealed  that 
the  vast  majority  of  ribotypes  fall  into  microdi verse  clusters  containing  <1%  sequence 
divergence.  It  is  proposed  that  the  observed  microdiverse  clusters  form  important  units 
of  differentiation  in  microbial  communities.  They  are  hypothesized  to  arise  by  selective 
sweeps  and  contain  high  diversity  because  competitive  mechanisms  are  too  weak  to 
purge  diversity  from  within  them. 
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INTRODUCTION 


We  are  only  beginning  to  understand  the  extent  of  microbial  diversity  and  principles 
controlling  the  distribution  and  abundance  of  microorganisms  in  natural  environments. 
Although  cultivation-independent  techniques  have  revealed  that  microbial  communities 
are  extraordinarily  diverse  (Pace,  1997;  Hugenholz  et  al.,  1998),  the  number  of  co¬ 
existing  bacterial  types  in  a  natural  microbial  community  and  whether  these  are  organized 
into  some  natural  units  of  differentiation  remain  unknown. 

Most  microbes  defy  cultivation  by  standard  methods.  Therefore,  the  only  reliable 
way  to  estimate  microbial  diversity  is  by  using  sequence  data  from  genes  obtained 
directly  from  environmental  samples.  The  rRNA  gene  has  been  adopted  as  the  most 
commonly  used  tool  to  determine  evolutionary  relationships  and  diversity  (Woese,  1987; 
Pace,  1997).  Most  commonly,  this  is  done  by  the  extraction  of  environmental  DNA, 
polymerase  chain  reaction  (PCR)  amplification  of  target  genes,  clone  library  construction 
from  amplified  gene  fragments,  and  gene  sequencing  (Head  et  al.,  1998). 

The  development  of  molecular  methods  has  provided  a  powerful  tool  for  the  study 
of  microbial  diversity.  In  1998,  GenBank  contained  only  approximately  8,500  16S  rRNA 
sequences,  a  majority  of  which  belonged  to  cultured  prokaryotes  (Rappe  and  Giovannoni, 
2003).  Within  a  period  of  six  years,  this  number  has  grown  to  over  183,000  entries.  Most 
of  the  recently  added  sequences  are  from  environmental  clones  derived  from  sediments 
and  aquatic  environments.  It  has  been  observed  that  a  high  number  of  these  are  closely 
related,  but  not  identical  to  the  sequences  already  deposited  in  GenBank,  suggesting  that 
these  closely  related  sequences  may  form  discrete  groups  (Rappe  and  Giovannoni,  2003) 

Microdiverse  sequences  have  been  observed  in  rRNA  genes  retrieved  from 
various  environments  (Field  et  al.,  1997;  Garcia-Martinez  and  Rodriguez- Valera,  2000; 
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Casamayor  et  al.,  2002).  Although  the  nucleotide  divergence  observed  in  these  cloned 
sequences  could  be  explained  by  evolution,  it  remains  unclear  how  much  of  this  variation 
stems  from  experimental  errors  and  small-scale  variation  in  sequences  among  rRNA 
operons.  Since  the  contribution  of  experimental  errors  to  diversity  is  difficult  to 
establish,  most  studies  have  clustered  sequences  into  95-99%  sequence  similarity  groups 
(Hughes  et  al.,  2001 ;  Martin,  2002;  Torsvik  et  al.,  2002).  However,  in  order  to  elucidate 
relationships  and  diversity  at  finer  scales  of  sequence  divergence,  methods  that  minimize 
and  account  for  contribution  of  sequence  artifacts  need  to  be  developed.  The 
development  of  such  methods  is  addressed  in  this  thesis. 

Microbial  diversity  estimates 

Statistical  methods  hold  promise  for  describing  the  diversity  of  a  given  environment, 
wherein  the  observed  sequence  diversity  is  extrapolated  to  that  of  a  sampled  clone  library 
or  environment  (Moyer  et  al.,  1998;  Dunbar  et  al.,  1999;  Hughes  et  al.,  2001).  Typically, 
the  diversity  of  microorganisms  is  assessed  using  sequence  data  from  genes  obtained 
directly  from  the  environment.  The  diversity  of  ribotypes  or  clusters  in  a  clone  library  is 
then  estimated  using  statistical  approaches  (Dunbar  et  al.,  1999;  Hughes  et  al.,  2001; 

Stach  et  al.,  2003).  These  methods  were  only  recently  introduced  to  microbial  ecology 
(Hughes  et  al.,  2001)  and  the  critical  evaluation  of  their  accuracy  has  remained  a 
challenge  because  molecular  surveys  have  not  produced  large  enough  data  sets. 

Most  of  the  statistical  methods  were  developed  for  estimating  macroorganismal 
diversity  and  these  fall  into  two  general  categories:  (i)  parametric  methods,  where  the 
abundance  distribution  of  taxa  is  assumed  to  have  a  specified  parametric  form,  and  (ii) 
non-parametric  methods,  where  no  abundance  distribution  model  is  assumed  (May,  1975; 
Colwell  and  Coddington,  1994;  Krebs,  1999).  In  microbial  ecology,  the  most  commonly 
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applied  richness  estimator  is  the  non-parametric  Chaol  estimator  (Chao,  1984,  1987; 
Hughes  et  al.,  2001;  Bohannan  and  Hughes,  2003).  It  has  been  noted  that  the  Chaol 
estimator  gives  biased  (low)  estimates  of  diversity,  especially  for  very  heterogeneous 
communities  (Mao  and  Lindsay,  2001;  Bohannan  and  Hughes,  2003).  Thus,  other 
approaches  such  as  parametric  analyses,  although  computationally  more  demanding, 
should  be  considered  in  microbial  diversity  analyses.  Especially  for  large  data  sets, 
parametric  methods,  rather  than  non-parametric  alternatives,  would  be  expected  to  more 
accurately  estimate  the  diversity  of  highly  diverse  communities.  The  application  of  both 
parametric  and  non-parametric  methods  to  large  microbial  data  sets  is  evaluated  in  this 
thesis. 

Goals  of  this  thesis 

This  thesis  addresses  two  important  questions:  (i)  How  many  distinct  types  of  bacteria 
co-exist  in  a  microbial  community,  and  (ii)  do  these  types  form  discrete  phylogenetic 
clusters  of  potential  ecological  significance?  These  questions  were  explored  by 
comprehensively  sampling  two  large  clone  libraries  obtained  from  two  complex  marine 
communities  (coastal  bacterioplankton  and  sediment  sulfate-reducing  bacteria)  by  (i) 
removal  of  sequence  artifacts  generated  by  library  construction  techniques  that  may 
confound  accurate  diversity  estimation,  (ii)  detailed  phylogenetic  analysis  of  large  16S 
rRNA  clone  libraries,  and  (iii)  refinement  and  application  of  parametric  diversity 
estimators. 

One  objective  of  this  thesis  was  to  explore  the  genetic  diversity  and  fine-scale 
phylogenetic  structure  of  two  complex  microbial  communities.  We  chose  to  sample 
communities  from  two  environments  displaying  extremes  in  structural  difference:  mixed 
coastal  ocean  and  highly  structured  salt-marsh  sediment  (Chapter  2  and  3).  Therefore, 


one  may  expect  that  the  underlying  community  composition  of  the  two  communities 
would  be  different.  For  example,  it  may  be  hypothesized  that  the  efficient  mixing  in  the 
pelagic  environment  may  allow  for  more  efficient  selective  sweeps  within  the 
community.  These  would  serve  to  purge  diversity  leading  to  a  more  simple  overall 
community  composition.  Since  sediment  communities  are  highly  diverse  (Ravenschlag  et 
al.,  1998;  Whitman  et  al.,  1998)  we  focused  on  investigating  one  metabolic  guild  - 
sulfate-reducing  bacteria  (SRB).  However,  for  the  coastal  ocean  sample  we  investigated 
the  entire  bacterioplankton  community. 

Large  libraries  based  on  amplified  16S  rRNA  gene  fragments  were  constructed 
from  both  communities,  and  were  corrected  for  chimeric  sequences  and  amplification 
errors  to  allow  phylogenetic  interpretation  at  all  levels  of  sequence  divergence.  The 
corrected  set  of  sequences  was  used  to  estimate  total  diversity  and  patterns  of 
relationships  by  grouping  sequences  into  similarity  clusters  (100%,  99%,  98%  etc). 

Large  clone  libraries  were  used  to  critically  evaluate  the  existing  statistical 
approaches  in  microbial  ecology  and  to  develop  new  ones.  We  observed  that  the  most 
commonly  applied  diversity  estimator,  Chaol,  significantly  underestimated  diversity  of 
the  complex  bacterioplankton  library.  This  was  established  by  evaluating  the  Chaol 
using  simulated  communities  whose  species  abundances  were  based  on  the  dataset 
presented  in  Chapter  3.  Thus,  we  modified  parametric  estimators  to  better  account  for  the 
way  microbial  communities  are  sampled.  The  modified  parametric  estimator  was  applied 
to  the  bacterioplankton  library  and  simulated  communities,  and  compared  to  the 
commonly  used  diversity  estimator  Chaol  (Chapter  4).  The  modified  parametric 
approach  may  ultimately  provide  more  reliable  estimates  of  microbial  diversity. 
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Summary 

The  biogeochemistry  of  North  Atlantic  salt  marshes 
is  characterized  by  the  interplay  between  the  marsh 
grass  Spartina  and  sulphate-reducing  bacteria  (SRB), 
which  mineralize  the  diverse  carbon  substrates  pro¬ 
vided  by  the  plants.  It  was  hypothesized  that  SRB 
populations  display  high  diversity  within  the  sedi¬ 
ment  as  a  result  of  the  rich  spatial  and  chemical 
structuring  provided  by  Spartina  roots.  A  2000- 
member  16S  rRNA  gene  library,  prepared  with  delta- 
proteobacterial  SRB-selective  primers,  was  analysed 
for  diversity  patterns  and  phylogenetic  relationships. 
Sequence  clustering  detected  348  16S  rRNA 

sequence  types  (ribotypes)  related  to  delta- 
proteo  bacteria  I  SRB,  and  it  was  estimated  that  a  total 
of  623  ribotypes  were  present  in  the  library.  Similarity 
clustering  showed  that  •  46%  of  these  sequences  fell 
into  groups  with  <1%  divergence;  thus,  microhetero¬ 
geneity  accounts  for  a  large  portion  of  the  observable 
genetic  diversity.  Phylogenetic  comparison  revealed 
that  sequences  most  frequently  recovered  were 
associated  with  the  Desulfobacteriaceae  and  Des- 
ulfobulbaceae  families.  Sequences  from  the  Des- 
ulfovibrionaceae  family  were  also  observed,  but  were 
infrequent.  Over  80%  of  the  delta-proteobacterial 
ribotypes  clustered  with  cultured  representatives  of 
Desulfosarcina ,  Desulfococcus  and  Desulfobacterium 
genera,  suggesting  that  complete  oxidizers  with  high 
substrate  versatility  dominate.  The  large-scale 
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approach  demonstrates  the  co-existence  of  numer¬ 
ous  SRB-like  sequences  and  reveals  an  unexpected 
amount  of  microdiversity. 

Introduction 

Salt  marshes  are  among  the  most  productive  environ¬ 
ments,  with  primary  production  rates  ranging  from  460  to 
3700 gem’ 2 year1  (Gallagher  etal .,  1980;  Wiegert  and 
Pomeroy,  1981).  Formation  and  persistence  of  marshes 
is  determined  by  the  growth  of  marsh  grasses  because 
their  dense  stands  can  trap  and  stabilize  sediment  in  the 
face  of  the  erosive  power  of  tides  and  waves.  Along  the 
Atlantic  coast  of  the  United  States,  marshes  are  domi¬ 
nated  by  the  smooth  cord  grasses  Spartina  altemiflora 
and  S.  patens,  which  permeate  the  sediment  with  a  com¬ 
plex  rhizome  system  and  reach  high  production  and  turn¬ 
over  rates.  Some  of  the  plant-derived  organic  matter  is 
exported  to  coastal  waters  (Teal,  1962;  Howes  and  Goe- 
hringer,  1994),  but  a  large  portion  remains  within  marsh 
sediments  where  it  is  decomposed  by  fermentation  and 
anaerobic  respiration.  Sulphate  reduction,  mediated  by 
sulphate-reducing  bacteria  (SRB),  is  typically  the  prevail¬ 
ing  carbon  mineralization  process  in  marine  anoxic  sedi¬ 
ments  and  exceeds  respiration  using  other  electron 
acceptors,  including  oxygen,  nitrate  and  metal  oxides  (Jor¬ 
gensen,  1982;  Canfield  etal.,  1993).  Although  in  some 
marsh  sediments,  iron(lll)  has  been  suggested  to  be 
important  to  respiration  (Canfield  and  Des  Marais,  1993; 
Joye  etal.,  1996;  Lowe  etal.,  2000),  sulphate  reduction 
usually  accounts  for  67-80%  of  all  respiration  processes 
(Howarth  and  Teal,  1979;  Howarth  and  Giblin,  1983; 
Howarth  and  Hobbie,  1985). 

Marsh  grasses  and  SRB  appear  to  maintain  a  complex 
relationship,  which  displays  elements  of  both  antagonism 
and  dependence.  On  the  one  hand,  sulphide,  the  end- 
product  of  sulphate  reduction,  may  have  negative  effects 
on  the  plant  because  of  its  toxicity,  and  oxygen  leaking 
into  the  sediment  from  the  aerenchyma  may  inhibit  sul¬ 
phate  reduction.  On  the  other  hand,  sulphate  reduction 
rates  can  be  tightly  correlated  to  marsh  grass  production. 
For  example,  a  fivefold  increase  in  sulphate  reduction 
rates  has  been  observed  during  the  above-ground  elon¬ 
gation  of  the  tall  form  of  S.  altemiflora  in  a  New  England 
marsh.  Between  early  June  and  August,  Spartina  roots 
grow  rapidly  and  leak  large  amounts  of  exudates,  which 
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serve  as  substrates  for  SRB  growth  (Hines  etal.,  1989; 
1999).  This  is  consistent  with  the  observation  that 
sulphate-reducing  activity  follows  seasonal  patterns  of 
vegetational  changes  (Currin  etal. ,  1995)  and  that  rRNA 
abundance  of  the  SRB  genera  Desulfonema,  Desulfococ- 
cus  and  Desulfosarcina  peaks  during  the  early  summer 
root  growth  of  Spartina  (Rooney- Varga  et  a/.,  1998;  Hines 
et  al.t  1999).  Thus,  it  is  likely  that  the  SRB  community  is 
tightly  coupled  to  marsh  grass  activity  and  that  seasonal 
and  spatial  differentiation  of  root  activity  has  a  strong 
influence  on  the  diversity  of  niche  spaces  available  to 
SRB. 

Spartina  plants  provide  a  large  variety  of  potential  car¬ 
bon  sources  to  the  SRB  community,  yet  the  extent  to 
which  these  different  compounds  are  used  by  diverse 
types  of  SRB  is  only  beginning  to  be  understood.  Roots 
directly  exude  simple  fatty  acids  and  alcohols,  such  as 
malate,  ethanol  (Mendelssohn  etal.,  1981)  and  acetate 
(Hines  etal.,  1994).  These  may  be  important  substrates 
for  SRB,  which  may  take  them  up  directly  from  the  plant 
as  suggested  by  an  increase  in  SRB  populations  associ¬ 
ated  with  the  rhizosphere  during  the  growth  season 
(Hines  etal.,  1999).  However,  the  quantitative  impor¬ 
tance  of  plant  exudates  within  the  total  sediment  commu¬ 
nity  remains  unknown,  and  acetate,  which  is  regarded  as 
one  of  the  central  metabolites  in  anaerobic  communities, 
was  found  to  support  only  about  10%  of  sulphate  reduc¬ 
tion  in  marsh  sediments  (Howarth,  1993).  This  points  to 
the  importance  of  other  plant-derived  substrates  released 
during  the  decay  of  Spartina  litter  and  includes  complex 
carbohydrates  (Opsahl  and  Benner,  1999),  phenolics  and 
humic  acids  (Wilson  etal.,  1986).  Recent  isolation  of 
SRB  capable  of  degrading  diverse  and  previously  unsus¬ 
pected  compounds  has  suggested  that  they  are  metabo¬ 
lized  by  SRB  in  the  environment.  For  example,  SRB 
capable  of  utilization  of  long-chain  fatty  acids  and  alco¬ 
hols,  glycolate  (Friedrich  and  Schink,  1995),  hydrocar¬ 
bons  (Aeckersberg  etal.,  1991)  and  aromatic 
compounds  (Beller  and  Spormann,  1997;  Phelps  etal., 
1998;  Galushko  etal.,  1999;  Harms  etal.,  1999)  have 
been  described.  High  metabolic  versatility  is  found  par¬ 
ticularly  in  the  genera  Desulfosarcina,  Desulfococcus  and 
Desulfobacterium  (Widdel  and  Bak,  1992),  and  it  is  thus 
hypothesized  that  these  play  an  important  role  in  the 
marsh. 

Among  the  major  phylogenetic  groups  of  SRB,  delta- 
proteobacterial  SRB  have  been  shown  to  be  important  in 
salt  marsh  sediments  by  both  culture-dependent  and 
independent  studies.  For  example,  using  rRNA-targeted, 
quantitative  oligonucleotide  hybridization,  Desulfovibrio, 
Desulfobacteriaceae  and  Desulfobulbus  accounted  for 
8  30%  of  Bacteria  rRNA,  and  Desutfobacteriaceae  alone 
accounted  for a  20%,  probably  making  it  the  dominant 
group  in  the  marsh  sediment  (Devereux  et  a!.,  1996).  The 


metabolic,  physiological  and  phylogenetic  diversity  of 
delta-proteobacterial  SRB  has  been  studied  extensively, 
and  it  appears  that  a  number  of  metabolic  properties  are 
confined  to  specific  phylogenetic  groups.  Indeed,  the 
traditional  classification  of  SRB  into  complete  and  incom¬ 
plete  oxidizers  has  largely  been  confirmed  by  rRNA- 
based  phytogeny.  Complete  oxidizers,  capable  of  acetate 
mineralization,  are  mainly  represented  by  the  genera 
Desulfobacter,  Desulfobactehum,  Desulfosarcina  and 
Desulfococcus,  whereas  incomplete  oxidizers,  which  oxi¬ 
dize  carbon  substrates  to  acetate,  are  mainly  represented 
by  Desulfovibrio  and  Desulfobulbus.  The  links  between 
physiology  and  phylogeny  of  the  delta-proteobacterial 
SRB  enable  some  prediction  of  community  properties, 
based  on  quantitative  16S  rRNA  hybridization  and  some 
sequence  surveys  (Devereux  and  Stahl,  1993;  Loy  etal., 
2002). 

Here,  we  investigated  the  16S  rRNA  diversity  of  the 
delta-proteobacterial  SRB  community  associated  with  the 
sediments  of  a  New  England  S.  altemillora  salt  marsh 
(Fig.  1).  We  hypothesize  that  the  plant  rhizomes  structure 
the  environment  into  numerous  microniches,  which  are 
reflected  in  high  overall  diversity  of  the  SRB  community. 
Furthermore,  we  explored  the  question  of  whether  the 
high  substrate  diversity  created  by  the  plant  is  reflected  in 
the  presence  of  phylogenetic  groups  of  metabolically  dif¬ 
ferentiated  SRB.  For  example,  it  may  be  expected  that 
complete  oxidizers  dominate  the  SRB  community, 
because  of  their  broad  substrate  spectra.  Our  approach 
is  based  on  a  large-scale  survey  of  a  16S  rRNA  gene 
library  generated  using  a  polymerase  chain  reaction 
(PCR)  primer  set  specific  for  delta-proteobacterial 
sequences.  The  library  was  constructed  from  monthly 


Fig.  1.  Location  of  the  sampling  site  (arrow)  in  the  Plum  Island  salt 
marsh,  MA.  USA. 
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samples  collected  over  an  entire  growth  cycle  of  the 
marsh  grass  S.  altemiflora  and  was  analysed  by  diversity 
estimators  and  phylogenetic  methods. 


Results 

Clone  library  analysis 

Approximately  47%  of  the  =1650  positive  16S  rRNA  gene 
sequences  obtained  from  the  2000-member  done  library 
were  associated  with  the  delta-proteobacterial  subclass. 
Non-delta-proteobacterial  sequences  related  to  epsilon- 
Proteobacteria,  Firmicutes  and  Cytophaga/Flexibacter 
were  also  amplified  because  of  the  broad  specificity  of 
primer  385F,  which  was  designed  to  cover  all  known  SRB 
groups  within  delta- Proteobacteria.  A  large  majority  of  the 
774  delta-proteobacterial  sequences  clearly  fell  within 
previously  identified  SRB  families:  Desulfobacteriaceae, 
Desulfobulbaceae  and  Desulfovibrionaceae.  The  only 
exceptions  were  eight  sequences  for  which  the  associa¬ 
tion  with  other  delta-proteobacterial  groups  Bdellovibrio, 
Syntrophobacter  and  GeobacteriPelobacter  was  ambigu¬ 
ous,  i.e.  the  sequences  were  less  than  85%  similar  to 
these  groups.  These  sequences  were  induded  in  further 
analysis  of  delta-proteobacterial  SRB  relationships 
because  of  their  close  relationship  to  the  SRB.  Further¬ 
more,  the  GeobacteriPelobacter  group  harbours  repre¬ 
sentatives  capable  of  sulphur  reduction  and  are  thus 
ecologically  related  to  SRB  (Lonergan  etal.,  1996).  A 
second  library  skewed  towards  Gram-positive  bacteria 
was  constructed;  however,  despite  sequencing  of  1000 
clones,  no  sequence  related  to  SRBs  was  detected  (data 
not  shown). 

The  delta-proteobacterial  sequences  were  subjected 
to  a  detailed  evaluation  of  Taq  errors  and  chimera  forma- 
ton.  A  large  fraction  of  the  sequences  displayed  a  very 
high  similarity  to  each  other,  a  phenomenon  potentially 
caused  by  base  misincorporation  during  amplification.  It 
was  thus  decided  to  conduct  a  detailed  estimation  of  the 
potential  contribution  of  Taq  error  to  sequence  diversity. 
This  was  done  by  fitting  nucleotide  positions  of  amplified 
sequences  to  16S  rRNA  secondary  structure  models 
(Cannone  etal.,  2002).  We  considered  that  there  was  a 
base  misincorporation  if  nucleotide  changes  occurred  in 
positions  that  (i)  are  >98%  conserved  in  the  entire  bac¬ 
terial  16S  rRNA  secondary  structure  data  set;  and  (ii) 
lead  to  non-canonical  basepairing  in  stem  structures  and 
were  absent  in  closely  related  sequences.  The  first  of 
these  rules  is  expected  to  lead  to  a  slight  overestimation 
of  Taq  error  while  the  second  is  likely  to  result  in  under¬ 
estimation.  However,  the  determined  Taq  error  rate 
agreed  remarkably  well  with  theoretically  predicted  rates 
based  on  reported  values  of  Taq  misincorporation  rates 
of  2  x  1 0"5  nucleotides  per  cycle  (Tindall  and  Kunkel, 


1988).  The  rate  based  on  the  above  rationale  was 
1.7x1  O'5  and  2.5  xIO"5  nucleotides  per  cycle  for 
regions  analysed  with  rules  (i)  and  (ii)  respectively. 
The  data  were  also  checked  for  putative  chimeras  by 
the  RDP  chimera_check  (Maidak  etal.,  1999)  and  the 
chimerabuster  algorithm,  which  was  newly  developed 
specifically  to  analyse  clone  libraries  with  high  coverage. 
This  identified  32  sequences  deemed  likely  chimeras  by 
one  of  the  chimera  identification  methods.  Based  on 
these  analyses,  an  additional,  corrected  data  set  of 
delta-proteobacterial  sequences  was  created  in  which 
putative  Taq  errors  were  corrected  and  from  which  puta¬ 
tive  chimeras  were  removed. 

Overall  features  of  the  sequences 

Both  the  original  and  the  corrected  data  set  indicated  that 
very  high  numbers  of  delta-proteobacterial  SRB 
sequences  co-existed  in  the  marsh  sediment  samples.  In 
the  corrected  data  set.  a  total  of  348  ribotypes  (groups  of 
identical  sequences)  were  identified.  This  is  *  23%  lower 
than  in  the  uncorrected  data  set  primarily  because  of  the 
removal  of  putative  Taq  errors  and  chimeras.  Rarefaction 
analysis  and  the  Chao-1  non-parametric  diversity  estima¬ 
tor  were  applied  to  both  data  sets  to  estimate  how  com¬ 
pletely  the  library  had  been  sampled  and  to  extrapolate  to 
total  sequence  diversity  (Hughes  etal.,  2001).  Rarefac¬ 
tion,  which  plots  the  number  of  clones  screened  versus 
the  number  of  ribotypes  detected,  showed  that  neither 
data  set  reached  an  asymptote,  indicating  that  the  diver¬ 
sity  in  the  clone  library  is  even  higher  (Fig.  2).  This  was 
confirmed  by  the  Chao-1  estimator,  which  computed  623 


number  of  delta-proteobacterial  clones  sampled 


Fig.  Z  Rarefaction  analysis  of  similarity  groups  within  delta-proteo¬ 
bacterial  SRB  sequences  with  different  sequence  identity  cut-off. 
Curves  were  calculated  using  the  algorithm  described  by  Hurlbert 
(1971)  and  are  plotted  as  the  number  of  identity  clusters  versus  the 
number  of  clones  Identity  clusters  were  identified  tor  uncorrected 
100%  (open  diamonds)  and  corrected  100%  (filled  diamonds)  99% 
(filled  squares),  98%  (filled  circles)  and  97%  (filled  triangles)  nucle¬ 
otide  identity;  bars  represent  standard  deviation  of  the  statistical 
resampling  process. 
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ribotypes  for  the  total  sequence  diversity  in  the  clone 
library. 

Further  analysis  suggested  that  the  initial  observation 
of  close  relationships  among  large  numbers  of  sequences 
is  preserved  even  after  correction  of  putative  Taq  errors, 
but  that  deep  phylogenetic  lineages  were  well  sampled. 
Rarefaction  analysis  of  the  corrected  data  set,  using  100% 
and  99%  sequence  identity  to  define  taxonomic  units, 
indicated  that a  46%  of  the  sequences  fell  into  clusters  in 
which  members  differed  by  <1%  nucleotide  difference 
(Fig.  2).  At  99%  identity,  only  200  sequence  groups  were 
detected,  and  Chao-1  yielded  an  estimate  of  332  groups. 
Decreasing  the  sequence  identity  cut-off  to  98%  and  97% 
produced  168  and  127  sequence  types  respectively.  The 
total  sequence  diversity  based  on  Chao-1  was  261  for  the 
98%  group  and  191  for  the  97%  identity  groups. 

Phytogeny  of  delta-proteobacterial  SRB-like  sequences 

The  delta-proteobacterial  SRB-like  community  clustered 
into  three  large  and  several  smaller  clades  based  on  the 
distance  and  parsimony  analysis  using  one  representative 
sequence  of  each  98%  similarity  group  (Fig.  3).  Over  80% 
were  associated  with  the  family  Desulfobacteriaceae 
( Desulfosardna ,  Desulfobacterium,  Desulfococcus  and 
Desulfonema)  (Fig.  3),  suggesting  that  complete  oxidizers 
with  high  substrate  versatility  dominate.  Clades  I,  II  and  V 
fell  into  the  Desulfobacteriaceae  (Fig.  3).  Clade  I, 
described  in  detail  below,  was  by  far  the  largest  containing 
a  55%  of  the  total  sequences  (424).  Clade  II.  although 
phylogenetically  diverse,  comprised  only  about  15%  of 
sequences  (113)  and  represented  the  third  largest  cluster. 
Closely  related  sequences  in  clade  II  were  environmental 
clones  recovered  from  benzene-degrading  enrichments 
(Phelps  etal.,  1998),  hydrocarbon  seeps  (AF154102)  and 
wetlands  (AY216442).  The  only  cultured  representative 
was  Desulfobacterium  anilinii  (Fig.  3).  Clade  III  repre¬ 
sented  17%  of  the  sequences  and  fell  into  the  incomplete- 
oxidizing  Desulfobulbaceae.  Cultured  relatives  were 
members  of  the  genera  Desulforhopalus,  Desulfofustis . 
Desulfocapsa  and  Desulfotalea,  but  none  closely  matched 
the  sequences  recovered  from  the  marsh  sediment 
(Fig.  3).  For  example,  one  of  the  two  largest  98%  consen¬ 
sus  groups  of  sequences  (17  members)  is  only  93%  sim¬ 
ilar  to  its  closest  cultured  relative  Desulfocapsa  sp.  La4  1 
(AF228119).  Members  of  the  Desulfovibrio  genus  are 
assigned  to  clade  IV  and  are  only  represented  by  10 
sequences.  Eight  Desulfovibno  sequences  were  99%  sim¬ 
ilar  to  Desulfovibrio  BG-6,  isolated  from  salt  marsh  sedi¬ 
ments  in  New  Hampshire  (Rooney- Varga  etal.,  1998). 
Finally,  at  least  two  small,  deep  branching  clades  (Fig.  3, 
clades  V  and  VI),  containing  10  and  19  sequences,  rep¬ 
resent  novel  lineages  with  no  published  sequence  match¬ 
ing  with  >90%  similarity. 
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Clade  I  contained  the  three  largest  monophyletic  sub¬ 
codes,  which  all  had  >96%  sequence  identity.  Two  are 
shown  in  detail  in  Fig.  4  to  illustrate  relationships  among 
closely  related  sequences.  The  largest  subclade.  I A  (125 
clones),  fell  within  the  Desulfosarcinales  and  displayed 
>96%  similarity  to  two  Desulfosardna  variabilis  strains 
and  to  a  Desulfobacterium  cetonicum  strain  (Fig.  4A). 
Some  sequences  had  <1%  nucleotide  difference  from 
clones  recovered  from  geographically  disparate  environ¬ 
ments.  Most  notably,  the  most  abundant  ribotype  with  42 
sequences  (a  6%  of  total  delta-proteobacterial  clones) 
was  only  2  bases  different  from  clone  SB  4.53  obtained 
from  Antarctic  shallow-marine  sediments  (Purdy  etal., 
2003)  (Fig.  4A).  The  second  subclade,  IB  (Fig.  4B).  con¬ 
tained  71  sequences  closely  related  to  cloned  sequences 
from  permanently  cold  marine  sediments  (Sva008l) 
(Ravenschlag  etal.,  1998)  and  an  oligochaete  endosym- 
biont  ( Olavius  algarvensis  sulphate-reducing  endosym- 
biont)  (Dubilier  etal.,  2001).  The  third  subclade  (1C;  not 
shown  in  detail)  contained  57  sequences  closely  related 
to  uncultured  SRB  (95%  similar  to  the  clone  Eel-3G12) 
from  anoxic  methane-oxidizing  consortia  recovered  from 
continental  shelf  sediments  (Orphan  etal.,  2001).  Sub¬ 
codes  IB  and  1C  had  Desulfobacterium  indolicum  and 
Desulfonema  magnum  as  the  only  distantly  related  cul¬ 
tured  relatives  (<  93%  similar)  respectively.  A  number  of 
sequences  that  fell  into  clade  I  but  were  not  associated 
with  any  of  the  large  subcodes  were  closely  related  to 
strains  isolated  on  crude  oil  extracts  and  aromatic  hydro¬ 
carbons,  in  particular  strain  NaphS2  (Galushko  etal., 
1999).  These  sequences  also  matched  clones  SB10  and 
SB29  recovered  from  a  benzene-mineralizing  consortium 
(Phelps  etal.,  1998)  (Fig.  4). 

Discussion 

Large-scale  analysis  of  delta-proteobacterial  16S  rRNA 
genes  revealed  surprising  structural  features  of  the  micro¬ 
bial  community  within  the  marsh  sediments.  Although  it 
was  hypothesized  that  the  marsh  grass  Spartina  subdi¬ 
vides  the  sediment  into  a  large  number  of  microenviron¬ 
ments,  the  extremely  high  diversity  of  co-existing  SRB-like 
sequences  was  unexpected.  Our,  to  date  unprecedented, 
sampling  effort  of  this  phylogenetically  defined  group  com¬ 
prised  1650  sequences,  of  which  a  47%  were  identified 
as  delta-proteobacterial  SRB-like  sequences.  Nonethe¬ 
less,  only  a  55%  of  the  total  delta-proteobacterial  SRB- 
like  sequence  diversity  was  captured,  comprising  623 
ribotypes  based  on  the  Chao-1  diversity  estimator.  Most 
of  this  diversity  resulted  from  almost  50%  of  the  ribotypes 
displaying  <1%  nucleotide  difference  from  each  other. 
Deeper  lineages  were  well  sampled,  and  additional 
sequencing  effort  would  not  have  yielded  a  significant 
number  of  new  types.  Comparison  with  published 
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Fig.  3.  Phylogenetic  relationships  based  on  partial  16S  rONA  sequences  of  delta-proteobacterial  clones  from  Plum  Island  marsh  sediment  Each 
sequence  represents  a  98%  identity  cluster,  and  numbers  at  terminal  nodes  represent  how  many  clones  tell  into  the  98%  cluster.  Tree  construction 
was  by  neighbour  joining  using  the  Jukes-Cantor  correction;  bootstrap  values  are  based  on  100  replicates  and  are  shown  tor  branches  with 
>50%  support. 
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sequences  suggests  that  deep  phylogenetic  relationships 
of  delta-proteobacterial  SRB  are  beginning  to  be  well 
sampled  as  no  novel  lineages  with  <90%  identity  to  known 
SRB  were  detected.  Approximately  80%  of  the  sequences 
were  associated  with  lineages  of  metabolically  versatile 
Desulfobacteriaceae.  Thus,  the  data  suggest  that  diverse 
carbon  substrates  produced  by  marsh  plants,  or  released 
during  plant  decay,  have  a  strong  effect  on  structuring  the 
community,  but  the  data  also  pose  questions  about  the 
ecological  significance  of  the  observed  microdiversity. 

The  surprising  observation  in  the  initial  data  set  that 
over  50%  of  the  sequences  fell  into  1%  consensus  groups 
was  confirmed  overall  by  detailed  evaluation  of  potential 
sources  of  error.  The  amplification  protocol  was  designed 
to  minimize  PCR  artifacts  that  may  cause  small-scale 
sequence  divergence.  The  genes  were  amplified  to  mini¬ 
mize  PCR  bias  (Polz  and  Cavanaugh,  1998)  and  errors 
(Thompson  etal.,  2002).  Chimeras,  analysed  by  two 
methods  chimera_check  and  chimerabuster,  were 
determined  to  be  relatively  insignificant.  However.  1% 
divergent  sequences  remain  nearly  impossible  to  evalu¬ 
ate.  Nonetheless,  the  32  chimeras  identified  among 
sequences  with  a  1%  similarity  cut-off  represent  such  a 
small  number  that,  even  if  one  allows  for  a  significant 
increase  in  chimera  formation  among  near-identical 
sequences,  the  effect  on  overall  diversity  estimates  would 
be  small.  The  evaluation  of  Taq  errors  suggested  that  only 
»  23%  of  the  initially  observed  diversity  resulted  from  Taq 
error.  This  indicates  that,  indeed,  a  large  number  of 
microdiverse  delta-proteobacterial  sequences  co-exist  in 
the  marsh  sediments. 

Sequence  microdiversity,  described  previously  in  clone 
libraries  (Ferris  and  Ward,  1997;  Field  etal.,  1997; 
Amann,  2000;  Garda-Martinez  and  Rodriguez-Valera, 
2000;  Casamayor  etal.,  2002;  Ferris  etal.,  2003),  has 
been  ignored  in  recent  estimates  of  microbial  diversity 
because  of  the  assumption  that,  in  addition  to  Taq  error, 
small-scale  variation  in  sequences  among  rRNA  operons 
is  responsible  for  this  pattern  (Hughes  et  ai,  2001 ;  Curtis 
etal.,  2002;  Torsvik  etal.,  2002).  Indeed,  bacteria  can 
harbour  up  to  15  rRNA  operons  (Rainey  et  al.,  1996),  and 
16S  rRNA  sequences  commonly  differ  among  operons, 
but  differences  are  typically  <1%  (Klappenbach  etal., 
2001).  Although  interoperon  variation  may  be  responsible 
for  a  considerable  fraction  of  microdiversity  in  clone  librar¬ 
ies,  several  lines  of  evidence  suggest  that  microdiversity 
is  a  feature  of  co-existing  bacterial  strains.  We  recently 


Fig.  4.  Phylogenetic  relationships  among  two  dominant  subclades. 
Identical  sequences  were  removed  from  the  analysis,  and  their  num¬ 
ber  is  shown  at  the  terminal  nodes.  Trees  were  constructed  by  neigh¬ 
bour  joining  using  the  Jukes-Cantor  correction;  bootstrap  values  are 
based  on  100  replicates  and  are  shown  for  branches  with  >50% 
support.  Subdade  IA  (A);  subclade  IB  (B). 
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conducted  a  detailed  examination  of  16S  rRNA  diver¬ 
gence  within  78  published  whole  bacterial  genomes  with 
multiple  operons  (Acinas  era/.,  2004).  These  genomes 
contained  a  total  of  397  operons  but  only  220  different 
sequences,  showing  that  a  large  portion  of  the  operons 
harbour  identical  sequences.  If  these  genomes  were 
treated  as  a  microbial  community,  16S  rRNA  cloning  and 
sequencing  would  only  result  in  roughly  threefold  overes¬ 
timation  of  strain  diversity.  However,  this  estimate  neglects 
genomes  with  single  rRNA  operons.  Taking  these  into 
account,  overestimation  is  closer  to  2.  Furthermore, 
association  of  16S  rRNA  microvariation  with  different  co¬ 
existing  cells  has  been  documented  by  in  situ  hybri¬ 
dization  (Amann,  2000).  Finally,  physiologically  distinct 
bacterial  strains  with  identical16S  rRNA  have  been 
isolated  from  the  same  environment  (Sass  etal.,  1998). 
These  considerations  suggest  that  a  portion  of  the 
observed  microdiversity  in  the  delta-proteobacterial 
sequences  may  be  caused  by  closely  related  co-existing 
strains. 

To  what  extent  the  observed  microdiversity  represents 
ecologically  differentiated  populations  is  difficult  to  ascer¬ 
tain  at  this  point.  On  the  one  hand,  such  variation  may 
represent  ecologically  undifferentiated  populations  that 
have  simply  arisen  by  accumulation  of  mutations  during 
clonal  diversification.  On  the  other  hand,  several  lines  of 
evidence  suggest  that  at  least  some  of  the  16S  rRNA 
microdiversity  represents  populations  occupying  differen¬ 
tiated  niche  spaces.  The  rRN  As  are  slowly  evolving  genes, 
and  it  has  been  hypothesized  that  protein-coding  genes 
show  evidence  of  selective  sweeps  based  on  adaptive 
mutations  before  variation  would  be  seen  at  the  16S  rRNA 
level  (Palys  etal.,  1997).  Furthermore,  isolates  with  iden¬ 
tical  16S  rRNA  need  not  be  metabolically  or  physiologically 
identical  (Fox  etal.,  1992).  This  has  been  confirmed  by 
comparison  of  genome  sequences  of  closely  related 
organisms,  which  show  quite  extensive  differences  in  gene 
arrangement,  number  and  sequence  (Aim  etal.,  1999; 
Welch  et  al.,  2002;  Ivanova  et  al.,  2003;  Read  et  al.,  2003). 
That  genomic  variation  in  strains  with  identical  16S  rRNA 
sequences  can  co-exist  in  the  same  environment  has  been 
demonstrated  by  two  environmental  genomics  studies 
(Schleper  etal.,  1997;  Beja  etal.,  2002)  and  by  isolation 
of  SRB  strains  displaying  some  physiological  differences 
from  the  same  sample  (Sass  et  al.,  1998).  Thus,  it  appears 
likely  that  microdiversity  among  salt  marsh  delta-Proteo- 
bacteria  indicates  some  level  of  ecological  adaptation  and 
shows  the  need  for  more  detailed  studies 
Despite  high  microdiversity,  the  sequences  fell  into  well- 
delineated  phylogenetic  groups.  For  some  of  these 
groups,  inference  of  likely  biogeochemical  functions  is 
possible.  Over  two-thirds  of  the  delta-proteobacterial 
clones  were  most  similar  to  representatives  of  the  genera 
Desulfosarcina  and  Desulfobacterium,  suggesting  that 


complete  oxidizers  with  high  substrate  versatility  dominate 
the  marsh  sediments.  For  example,  Desulfosarcina  vari- 
abilis,  which  was  associated  with  the  largest  single  sub¬ 
code  (Fig.  4A),  has  metabolic  capabilities  that  are  well 
matched  to  carbon  substrates  such  as  acetate,  lactate 
and  ethanol  exuded  by  Spartina  roots.  This  is  corrobo¬ 
rated  by  previous  detection  of  Desulfosarcina- like  organ¬ 
isms  by  quantitative  slot-blot  hybridization  of  rRNA 
extracted  from  Spartina  rhizosphere  (Rooney-Varga  et  al., 
1997).  Furthermore,  plants  and  decomposing  plant  litter 
release  hydrocarbons  and  aromatics,  which  can  also  be 
used  directly  by  D.  variabilis-\\ke  organisms  such  as 
strains  oXySI  and  mXySI  (Harms  etal.,  1999).  The  utili¬ 
zation  of  the  complex  substrates  such  as  plant  phenolics 
and  flavonoids  may  be  an  overall  important  property  of  the 
salt  marsh  SRB  community. 

SRB  diversity  based  on  16S  rRNA  sequences  corre¬ 
lated  with  a  recent  exploration  of  diversity  in  dissimilatory 
sulphite  reductase  (dsr)  genes  conducted  on  exactly  the 
same  sediment  samples  (M.  Bahr  etal.,  unpublished). 
Both  16S  rRNA  and  dsr  clone  libraries  were  dominated 
by  sequences  associated  with  the  family  Desulfobacteri- 
aceae.  In  addition,  in  both  libraries,  incomplete-oxidizing 
Desulfobulbus  and  Desulforhopalus  genera  were 
detected.  Moreover,  both  studies  also  failed  to  detect 
members  of  the  completely  oxidizing  but  nutritionally 
restricted  genus  Desulfobacter  (Widdel  and  Bak,  1992). 
This  suggests  that  Desulfobacter,  which  almost  exclu¬ 
sively  uses  acetate  as  an  energy  source  (Widdel  and  Bak. 
1992),  may  be  at  a  competitive  disadvantage  in  the  rhizo¬ 
sphere  where  diverse  carbon  substrates  predominate  and 
acetate  has  been  found  to  support  only  a  10%  of  SRB 
activity  (Howarth,  1993).  The  Desulfovibrio  and  Desulfof- 
rigus/Desulfotalea  groups,  which  were  at  low  abundance 
in  the  16S  rRNA  gene  library,  were  not  detected  in  the 
smaller,  dsr  library  (M.  Bahr  etal.,  unpublished). 

Other  molecular  diversity  studies  have  detected  both 
differences  and  similarities  in  SRB-like  community  com¬ 
position.  Desulfobacteriaceae  dominate  SRB  communi¬ 
ties  in  other  salt  marsh  sediments  based  on  cloning  and 
quantitative  hybridization  studies  (Devereux  etal.,  1996; 
Rooney-Varga  etal.,  1997).  Desulfosarcina  and  Des- 
ulfonema  have  been  detected  in  marine  sediments 
(Llobet-Brossa  etal.,  2002;  Purdy  etal.,  2003),  microbial 
mats  (Risatti  etal.,  1994;  Teske  etal.,  1998;  Minz  etal., 
1999)  and  hydrocarbon  seeps  (Orphan  etal.,  2001). 
Desulfobacterium- like  sequences  dominated  the  delta- 
proteobacterial  portion  of  16S  rRNA  clone  libraries  from 
the  hydrocarbon-rich  hydrothermal  sediments  of  the 
Guaymas  Basin  (Dhillon  etal.,  2003).  The  genus  Desulfo¬ 
bacterium  is  nutritionally  versatile,  as  are  the  genera  Des¬ 
ulfosarcina  and  Desulfococcus  (Widdel  and  Bak.  1992), 
and  they  may  thrive  in  habitats  with  a  similarly  diverse 
substrate  spectrum.  However,  groups  that  were  not  abun- 
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dant  in  the  Spartina  marsh  library  can  be  important  in 
other  environments.  Cultivation  surveys  often  result  in  a 
predominance  of  Desulfovibrio  strains,  for  example  in 
water  columns  (Teske  et  ai,  1996)  and  in  freshwater  lake 
sediments  (Sass  etai,  1998).  Although  cultivation  bias 
favours  quickly  growing  and  robust  Desulfovibrio  strains, 
some  molecular  surveys  also  indicate  a  significant  abun¬ 
dance  of  incompletely  oxidizing  sulphate  reducers  in 
some  environments  (Trimmer  etai,  1997;  Llobet-Brossa 
et  ai,  2002). 

This  study  represents,  to  our  knowledge,  the  most 
extensive  sampling  of  a  specific  metabolic  guild  within  a 
microbial  community  so  far.  The  large-scale  approach 
yielded  several  surprising  results,  which  pose  important 
questions  for  future  research.  The  done  library  demon¬ 
strated  the  co-existence  of  a  high  diversity  of  SRB  organ¬ 
isms  with  similar  overall  metabolism  and  revealed  high 
amounts  of  microdiversity.  One  of  the  most  important 
questions  will  be  to  determine  at  what  level  of  genetic 
differentiation  these  co-existing  organisms  are  ecologi¬ 
cally  differentiated.  This  may  be  approached  by  a  combi¬ 
nation  of  targeted  isolation  of  closely  related  organisms 
followed  by  extensive  physiological  and  population  biolog¬ 
ical  studies.  In  addition,  new  techniques,  which  allow 
simultaneous  detection  of  metabolic  activity  and  molecu¬ 
lar  identification  of  microorganisms  (Boschker  et  ai ,  1998; 
Ouvemey  and  Fuhrman,  1999;  Radajewski  etal.,  2000; 
Adamczyk  etai,  2003;  Polz  etai,  2003),  may  provide 
insights  into  niche  differentiation  among  both  closely  and 
distantly  related  organisms.  A  further  challenge  will  be  to 
determine  what  specific  environmental  factors  select  for 
the  presence  of  one  SRB  group  over  another.  For  exam¬ 
ple,  this  study  showed,  in  agreement  with  previous  inves¬ 
tigations,  a  clear  dominance  of  Desulfosarcina- like 
sequences.  It  will  be  important  to  carry  out  comparative 
environmental  studies,  perhaps  combined  with  genomics, 
to  elucidate  relevant  factors  that  govern  the  distribution  of 
microorganisms  among  different  types  of  environments. 

Experimental  procedures 

Study  site  and  sampling 

Samples  were  collected  monthly  from  March  to  October  1998 
from  the  bulk  sediment  of  a  monotypic  stand  of  the  tall  form 
(2  m)  of  the  marsh  grass  Spartina  altemiflora  at  the  mouth 
of  the  Rowley  River  in  Plum  Island  Sound  salt  marsh  (north¬ 
eastern  Massachusetts)  (Fig.  1).The  creekside  sampling  site 
had  a  continuous  and  dense  cover  of  S.  altemiflora,  and  the 
sediments  showed  no  evidence  of  macrofaunal  activity.  The 
mean  tidal  range  was  2.6  m.  and  the  site  was  flooded  during 
high  tide  although  it  stayed  -  1  m  above  water  level  during 
low  tide.  Salinity  was  measured  in  a  small  tidal  pool  near  the 
site  and  was  between  20%o  and  34%o  during  the  sampling 
period.  Triplicate  cores  (5  cm  diameter)  were  taken  within  a 
few  metres  of  each  other  at  mid-tide,  immediately  cooled  to 
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4°C  and  transported  to  the  laboratory.  The  top  4  cm  of  each 
core  was  collected  using  a  sterile  scalpel,  pooled,  placed  in 
sterile  50  ml  polypropylene  tubes  and  stored  frozen  until  fur¬ 
ther  processing. 


DNA  extraction  and  purification 

DNA  was  extracted  by  a  modified  version  of  the  bead  beating 
extraction  protocol  (Lin  and  Stahl,  1995).  One  gram  of  sam¬ 
ple  was  combined  with  0.5  g  of  sterilized  0.1-mm-diameter 
zirconium-silica  beads  (BioSpec  Products)  with  500  pi  of 
equilibrated  phenol  (pH  7.0)  and  35  pi  of  lOx  buffer  (500  mM 
sodium  acetate  and  100  mM  EDTA  buffer.  pH  7.0),  vortexed 
for  -20  s  and  homogenized  four  times  for  30  s  in  a  reciprocal 
shaker  (Mini-bead  beater;  BioSpec  Products)  with  intermit¬ 
tent  cooling  on  ice.  The  sample  was  then  incubated  for 
10  min  at  60°C,  homogenized  for  an  additional  30  s  and 
centrifuged  at  10  000  r.p.m.  at  4°C  for  10  min  to  pellet  the 
beads  and  separate  the  phases.  The  supernatant  was  trans¬ 
ferred  to  a  clean  2  ml  tube.  The  remaining  beads  were 
amended  with  100  pi  of  lx  buffer,  subjected  to  additional 
homogenization  for  30  s,  and  the  supernatant  was  collected 
after  centrifugation  for  10  min.  Both  supernatants  were  com¬ 
bined  and  recentrifuged  at  14  000  r.p.m.  at  4°C  for  10  min  to 
separate  the  remaining  phenol.  The  upper  aqueous  phase 
was  transferred  to  a  clean  tube  and  extracted  twice  with  an 
equal  volume  of  buffer-equilibrated  phenol  (pH  7.0),  followed 
by  additional  extractions  with  an  equal  volume  of  phenol- 
chloroform  and  chloroform  respectively.  Nucleic  acid  was  pre¬ 
cipitated  overnight  at  -20°C  after  the  addition  of  ammonium 
acetate  (2.5  M  final  concentration).  MgCI2  (2  mM  final  con¬ 
centration)  and  0.7  volumes  of  isopropanol.  Nucleic  acids 
were  recovered  by  10  min  centrifugation  at  14  000  r.p.m., 
followed  by  washing  twice  with  1  ml  of  80%  ethanol  and 
resuspension  in  100  pi  of  milliQ  water.  RNA  was  removed 
from  a  subsample  of  30  pi  by  incubation  at  37°C  for  30  min 
with  20  U  of  RNase  I  (NE  BioLabs).  Final  purification  was 
performed  using  a  Qiagen  spin  column  PCR  purification  kit 
according  to  the  manufacturer's  instructions. 


16S  rRNA  gene  amplification,  doning  and  sequencing 

The  delta-proteobacterial  SRB  species-specific  primer  385F 
was  developed  by  combination  of  previously  published  SRB- 
specific  oligonucleotide  hybridization  probes  (Amann,  1995; 
Rabus  etai,  1996),  and  was  used  in  combination  with  the 
bacterial  primer  1492R  for  PCR  amplification  of  16S  rRNA 
(Table  1).  Each  of  the  six  monthly  samples  was  amplified  in 
10  replicate  reactions  to  minimize  stochastic  PCR  bias  (Polz 
and  Cavanaugh,  1998).  Each  20  pi  reaction  contained 
0.2  mM  each  dNTP,  2  mM  MgCI2l  0.1  pM  each  primer,  1  pi 
of  template  DNA  (5-10  ng),  lx  PCR  buffer  and  0.1  U  of  Taq 
polymerase  (Invitrogen)  and  was  earned  out  in  a  Robocycler 
(Stratagene)  using  the  following  conditions:  initial  denatur- 
ation  at  94°C  for  3  min,  followed  by  15  cycles  of  denaturation 
at  94°C  for  1  min,  primer  annealing  at  50°C  for  1  min,  elon¬ 
gation  at  72°C  for  2  min  with  a  final  extension  step  at  72°C 
for  5  min.  The  amplification  was  carried  out  for  only  15  cycles 
to  decrease  PCR  bias  (Polz  and  Cavanaugh,  1998)  and  the 
formation  of  Taq  error  and  chimeric  sequences  (Qiu  etal.. 
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Table  1.  16S  rflNA  gene-targeted  primers. 


Primer* 

Used  in: 

Sequence 

Specificity 

Reference 

385P 

907R 

1492RC 

PCR  amplification 
Sequencing 

PCR  amplification 

CTG  ACG  CAG  CRA  CGC  CG 

CCG  TCA  ATT  CMT  TTR  AGT  TT 

TAC  GGY  TAC  CTT  GTT  AYG  ACT  T 

Most  delta-proteobacterial  SRB 
Most  Bacteria 

Most  Bacteria  and  Archaea 

Amann  (1995);  Rabus  etal.  (1996) 
Lane  (1991) 

Lane  (1991);  Vergin  etal.  (1998) 

a.  16S  rRNA  positions;  E.  cob  numbering. 

b.  Designed  from  the  commonly  used  385  and  385b  SRB  probes  by  combining  all  degeneracies. 

c.  Modified  according  to  information  provided  by  Vergin  etal.  (1998)  by  incorporating  a  degeneracy  (T/C)  at  position  1508  (E  coli  numbering). 


2001).  The  replicates  of  PCR  amplifications  were  combined, 
precipitated  using  a  QIAquick  PCR  purification  kit  (Qiagen), 
resuspended  in  40  pi  of  milliO  water  and  purified  additionally 
using  the  QIAquick  gel  extraction  kit  (Qiagen).  The  combined 
products  were  reamplified  with  five  additional  PCR  cycles  to 
minimize  the  formation  of  heteroduplex  molecules  (Thomp¬ 
son  etal.,  2002)  and  purified  using  a  QIAquick  gel  extraction 
kit.  Subsequently,  all  6  month  samples  were  combined  in 
equal  amounts  of  DNA  and  used  for  cloning. 

Four  microlitres  of  the  combined  PCR  products  (final  con¬ 
centration  9.5  pg  mr1)  were  ligated  into  PCR  2.1-TOPO  vec¬ 
tor  and  transformed  into  One  Shot  TOP  10  chemically 
competent  Escherichia  coli  cells  (Invitrogen).  Cells  contain¬ 
ing  plasmid  inserts  were  selected  by  growing  on  LB  agar 
plates  (Difco)  in  the  presence  of  ampicillin  and  Xgal  accord¬ 
ing  to  the  manufacturer's  specifications  (Invitrogen).  White 
colonies  were  transferred  to  96-well  deep  blocks  containing 
in  each  well  1 .2  ml  of  Super  Broth  (32  g  of  tryptone,  20  g  of 
Bacto  yeast  extract,  5g  of  NaCI  per  litre)  and  ampicillin 
(50  mg  r1).  After  overnight  growth  at  37°C  with  shaking  at 
250  r.p.m.,  cells  were  harvested  by  centrifugation  at 
2800  r.p.m.  for  8  min  at  4°C,  and  plasmids  were  extracted 
using  the  RevPrep  Orbit™  workstation.  Purified  plasmids 
served  as  templates  for  partial  16S  rRNA  sequence  determi¬ 
nation  using  the  bacterial  primer  907R  (Table  1)  and  the 
BigDye  Termination  kit  version  3.0  (Applied  Biosystems). 
Completed  reactions  were  run  on  a  96-capillary  3730x1  DNA 
analyser  (Applied  Biosystems). 


Sequence  analysis 

The  sequencher  software  package  (Gene  Codes)  was  used 
to  remove  vector  and  primer  sequence  and  to  check  each 
sequence  visually  for  ambiguities  not  scored  by  the  auto¬ 
mated  sequence  analysis  program.  Subsequently, 
sequences  affiliated  with  the  delta-proteobacterial  subclass 
were  identified  by  blastn  (Altschul  etal.,  1990)  and  prelimi¬ 
nary  phylogenetic  tree  construction  using  the  neighbour¬ 
joining  method  within  the  ARB  sequence  analysis  package 
(Ludwig  etal.,  2004).  These  putative  delta-proteobacterial 
sequences  were  kept  for  further  analysis  and  subjected  to  a 
robust  screening  to  score  potential  PCR-induced  errors.  First, 
putative  Taq  errors  were  identified  by  mapping  each 
sequence  manually  to  a  secondary  structure  model  of  Des- 
ulfovibrio  desulphuricans  16S  rRNA  (Cannone  etal.,  2002). 
A  sequence  position  was  scored  as  a  Taq  error  if  (i)  the 
nucleotide  differed  from  universally  conserved  positions  in 
the  >98%  consensus  sequence  assembled  for  all  bacterial 
16S  rRNAs  (Cannone  etal.,  2002)  or  (ii)  a  non-canonical 


basepairing  occurred  in  a  stem  region  of  the  secondary  struc¬ 
ture  and  was  absent  in  other  closely  related  sequences. 
Secondly,  sequences  were  tested  for  indication  of  chimera 
formation  during  the  amplification.  Initially,  chimera_check 
implemented  in  the  RDP  (Maidak  etal.,  2001)  was  used. 
However,  for  a  large  number  of  the  sequences,  it  was  difficult 
to  conclude  with  high  probability  that  they  had  originated  from 
distinct  parental  sequences  because  no  sufficiently  close 
relatives  were  present  in  the  RDR  Thus,  the  chimerabuster 
analysis  tool  (http7/web.mit.edu/polz/seqtools/chimera.html) 
was  developed.  Briefly,  the  rationale  for  chimerabuster  is 
derived  from  the  fact  that  chimeras  are  combinations  of 
sequences  present  in  the  sample  and  that  well-sampled 
clone  libraries  should  have  a  high  incidence  of  co-occurrence 
of  chimeras  and  their  parental  sequences.  The  program  chi¬ 
merabuster  uses  the  two  highly  variable  regions  at  each  end 
of  the  molecule  as  in  silico  probes  with  adjustable  specificity 
cut-offs.  The  program  flags  all  sequences  in  which  each 
probe  matches  two  or  more  distinct  sequences  with  >1% 
sequence  difference.  Thus,  three  sequences  are  identified, 
of  which  two  are  parental  and  one  the  potential  chimera.  Of 
these  three,  the  sequence  with  the  lowest  incidence  in  the 
clone  library  was  identified  as  more  likely  to  be  chimeric 
because  chimeras  form  at  later  stages  in  the  amplification 
when  the  parental  sequences  are  already  abundant.  These 
putative  chimeras,  in  addition  to  those  identified  by 
Chimera.check,  were  excluded  from  the  data  set  for  phylo¬ 
genetic  analysis. 

To  determine  how  well  the  clone  library  was  sampled  at 
different  sequence  similarity  levels,  the  sequences  were  first 
grouped  into  100%,  99%,  98%  and  97%  similarity  groups  and 
rarefied.  A  clustering  tod,  which  uses  the  nearest  neighbour 
approach,  adds  a  sequence  to  a  cluster  if  there  is  at  least 
one  sequence  that  is  within  the  similarity  threshold  set  for  the 
clustering  (http'7/web.mit.edu/polz/seqtools/clusters.html). 
Rarefaction  was  carried  through  the  Rarefaction  Calculator 
(http://www2.biology.ualberta.ca/jbrzusto/rarefact.php).  To 
estimate  the  total  number  of  similarity  clusters  in  the  clone 
library  at  the  different  cut-offs,  the  Chao-1  non-parametric 
species  richness  estimator  was  calculated  (Chao,  1987; 
Hughes  etal.,  2001). 

Phylogenetic  analyses  were  carried  out  in  paup#,  version 
4.0b10  (Swofford,  1993).  For  determination  of  the  relation¬ 
ship  of  deeply  divergent  groups,  a  data  set  containing  a 
single  representative  from  each  98%  identity  cluster  was 
assembled.  Relationships  were  determined  using  the 
neighbour-joining  method  with  Jukes-Cantor  correction  and 
checked  for  consistency  using  parsimony.  The  most  variable 
regions  (E  coli  positions  452-463  and  849-850)  were 
excluded  from  phylogenetic  analyses  of  single  representative 
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sequences  of  each  98%  identity  cluster.  For  the  analyses  of 
sequences  within  1 00%  identity  clusters,  no  length  variation 
was  observed,  and  all  sequence  positions  were  included.  For 
each  analysis,  the  robustness  was  tested  by  bootstrap 
resampling  with  the  minimum  evolution  method  with  100 
replicates. 


Nucleotide  sequence  data 

The  sequences  of  the  cloned  1 6S  rRNA  SRB-like  genes  were 
deposited  in  GenBank  under  accession  numbers  AY374653- 
AY374982. 
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Erratum  Chapter  2 


Page  23  “Finally,  at  least  two  small,  deep  branching  clades  (Fig.  3,  clades  V  and  VI), 
containing  10  and  19  sequences, ...”  should  read  “Finally,  at  least  two  small,  deep 
branching  clades  (Fig.  3,  clades  V  and  VI),  containing  20  and  10  sequences, ...” 

Page  24,  Fig.  3:  “ Desulfovibrio  mediterraneus ”  exported  from  ARB  package  where  it 
was  incorrectly  referenced.  It  should  read  “ Desulfobulbus  mediterraneus .” 
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Although  molecular  data  have  revealed  the  vast  scope  of 
microbial  diversity1,  two  fundamental  questions  remain  unan¬ 
swered  even  for  well-defined  natural  microbial  communities: 
how  many  bacterial  types  co-exist,  and  are  such  types  naturally 
organized  into  phylogcnetically  discrete  units  of  potential  eco¬ 
logical  significance?  It  has  been  argued  that  without  such  infor¬ 
mation,  the  environmental  function,  population  biology  and 
biogeography  of  microorganisms  cannot  be  rigorously  explored^. 
Here  we  address  these  questions  by  comprehensive  sampling  of 
two  large  16S  ribosomal  RNA  clone  libraries  from  a  coastal 
bacterioplankton  community.  We  show  that  compensation  for 
artefacts  generated  by  common  library  construction  techniques 
reveals  fine-scale  patterns  of  community  composition.  At  least 
516  ribotypes  (unique  rRNA  sequences)  were  detected  in  the 
sample  and,  by  statistical  extrapolation,  at  least  1,633  co-existing 


ribotypes  in  the  sampled  population.  More  than  50%  of  the 
ribotypes  fall  into  discrete  clusters  containing  less  than  1% 
sequence  divergence.  This  pattern  cannot  be  accounted  for  by 
interoperon  variation,  indicating  a  large  predominance  of  closely 
related  taxa  in  this  community.  We  propose  that  such  micro- 
diverse  clusters  arise  by  selective  sweeps  and  persist  because 
competitive  mechanisms  arc  too  weak  to  purge  diversity  from 
within  them. 

Traditional  species  concepts  have  largely  been  concessions  to  the 
need  to  identify  bacteria  rcproducibly,  but  none  adequately  describe 
natural  units  of  microbial  diversity3.  It  has  recently  been  proposed 
that  natural  taxa  are  distinct  groups  of  strains  that  arise  by  periodic 
selection — a  process  of  continuing,  selectionally  neutral,  diversifica¬ 
tion  punctuated  by  adaptive  mutations  leading  to  selective  sweeps4. 
The  latter  events  purge  all  sequence  variants  except  those  associated 
with  the  genome  carrying  the  adaptive  mutation4.  One  of  the 
attractive  features  of  this  concept  is  that  it  should  be  applicable  to 
molecular  surveys  of  microbial  diversity  because  taxa  would  be 
identifiable  in  phylogenetic  trees  as  distinct  clusters  of  closely 
related  sequences1.  Moreover,  such  clusters  should  be  detectable 
independently  of  the  gene  used  to  construct  these  trees  as  long  as  the 
accumulation  of  variation  is  commensurate  with  the  occurrence  ot 
sweeps*.  However,  this  theory  has  not  been  applied  to  broad-scale 
studies  of  bacterial  diversity  in  the  environment.  Over  the  past  20 
years,  diversity  studies  have  primarily  been  based  on  analyses  of  16S 
rRNA  clone  libraries  but  it  has  remained  uncertain  to  what  extent 
fine-scale  patterns  of  variation  are  due  to  sequence  artefacts,  to 
heterogeneity  among  paralogous  operons  or  to  the  co -existence  of 
similar  but  differentiated  taxa1.  Furthermore,  it  has  not  been 
explored  whether  naturally  defined  units  of  differentiation  emerge 
from  recently  released  shotgun  sequence  data  from  the  Sargasso 
Sea6. 

We  deduced  that  the  discovery  of  ecologically  significant  patterns 
of  relationships  between  co-existing  ribotypes  requires,  first,  an 
examination  of  clone  libraries  large  enough  to  elucidate  relation¬ 
ships  at  all  levels  of  diffeientiation,  and  second,  methods  that 
minimize  and  account  for  the  contribution  of  sequence  artefacts 
and  paralogous  variation  to  diversity  estimates.  We  sequenced 
about  1,000  clones  from  each  of  two  polymerase  chain  reaction 
(PCR) -derived  16S  rRNA  libraries  constructed  from  the  same 
coastal  bacterioplankton  sample.  The  first  library  employed  com¬ 
mon  (standard)  amplification  protocols.  For  the  second,  a  modified 
protocol  was  designed  to  minimize  artefacts  and  to  identify  Taq 
errors  and  chimaeric  molecules  through  extensive  sequence  analyses 
(see  Methods).  This  approach  allowed  the  most  comprehensive 
analysis  of  any  single  gene  from  co-occurring  populations  so  far, 
even  in  view  of  the  recently  released  Sargasso  Sea  study,  which  in 
aggregate  sampled  a  similar  number  of  rRNA  genes  but  from  several 
locations,  dates  and  diverse  biogeochemical  conditions6.  Our  over-  j 
all  rationale  was  to  achieve  high  coverage  of  rRNA  genes  from  a 
single  community  while  estimating  and  compensating  for  the 
influence  of  artefacts  on  ribotype  diversity,  potentially  revealing  , 
emergent  patterns. 

Comparison  of  the  two  libraries  showed  that  changes  to  the  , 
amplification  protocol  alone  decreased  the  incidence  of  unique  j 
sequences  from  76%  (692  of  909)  in  the  standard  to  61%  (686  of 
1,131)  in  the  modified  library.  Correction  for  chimacras  and  Taq 
error  lowered  the  percentage  to  48%  (516  of  1,067)  unique  j 
sequences  (Fig.  la),  demonstrating  a  potentially  significant  contri¬ 
bution  of  PCR-induced  artefacts  to  (micro)Jivcrsity  estimates. 
Consequently,  these  corrections  yield  a  significantly  lower  estimate 
of  total  ribotype  diversity  for  the  sampled  community  when 
compared  with  the  unmodified  standard  library  (  1,633  versus 
3,881)  with  the  use  of  the  Chao-1  estimator7.  A  novel  estimator 
(N  ,/Nimx)  (ref.  2)  yielded  a  similar  value  of  2,236  sequences  for  the 
corrected  data  set.  This  good  agreement,  combined  with  the  low 
incidence  of  chimacras  and  the  observation  that  corrections  account 
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for  most  expected  Taq  errors  (see  Methods),  provides  confidence  in 
the  corrected  estimate  of  ribotype  diversity. 

A  vast  and  previously  unrecognized  predominance  of  micro- 
diverse  ribotypes  was  revealed  by  further  analysis  of  relationships. 
More  than  half  of  the  observed  sequences  in  the  modified  library  fell 
into  dusters  sharing  at  least  99%  sequence  consensus  (Fig.  la).  This 
result  is  still  more  marked  when  the  Chao- 1  diversity  estimator  is 
applied  to  the  data,  indicating  that  more  than  two-thirds  of 
ribotypes  might  be  members  of  99%  sequence  clusters  in  the 
sampled  bacterioplankton.  Defining  such  99%  clusters,  rather 
than  unique  ribotypes,  as  operational  taxonomic  units  (OTUs) 
decreases  diversity  estimates  from  1,633  to  520  OTUs,  a  decline  of 
about  70%.  However,  further  clustering  into  98%  and  97%  con¬ 
sensus  groups  decreases  the  number  of  OTUs  by  only  3%  (507 
OTUs)  and  11%  (450  OTUs),  respectively  (Fig.  la).  In  fact,  a 
remarkably  consistent  exponential  decline  was  observed  in  the 
number  of  OTUs  as  cluster  cut-off  values  were  decreased  from 
99%  to  75%  (Fig.  lb).  In  stark  contrast,  the  number  of  OTUs  greatly 
exceeds  this  exponential  trend  for  values  above  99%  (Fig.  lb).  An 
essentially  identical  relationship  emerged  from  a  phylogenetic 
(maximum-likelihood)  analysis,  in  which  the  accumulation  of 
lineages  per  arbitrary  time  unit  was  inferred  under  a  molecular 
dock  model*  (data  not  shown).  This  exponential  accumulation  of 
dusters  or  lineages  is  expected  if  the  creation  and  removal  of  taxa  are 
on  average  constant  over  time*.  The  sharp  discontinuity  observed 
above  99%  similarity  therefore  suggests  increased  diversification  or 
decreased  removal  of  diversity  within  microdiverse  clusters. 

The  overall  predominance  of  extremely  dosely  related  ribotypes 
also  emerges  from  phylogenetic  analyses  as  large  clusters  of  closely 
related  taxa  (Fig.  2,  and  Supplementary  Information).  These  are 
typically  well  separated  from  other  clusters,  as  indicated  by  a 
comparison  of  average  within-cluster  and  between-cluster  sequence 
divergence  (data  not  shown)5.  The  most  sequence-rich  micro- 


Rgura  1  Compositional  pattern  of  the  coastal  bacterioplankton  sample  a.  Rarefaction 
curves  of  the  number  of  OTUs  in  a  16S  rRfsiA  library  constructed  with  standard  (crosses, 
100%  sequence  similarity  cluster)  and  modified  (diamonds.  100%  sequence  similarity 
dusters,  squares,  99%,  triangles.  98%.  circles.  97%)  amplification  protocols  Standard 
deviations  fall  within  the  symbols  and  are  not  shown  b  Number  of  OTUs  plotted  against 
changng  degrees  of  cutoffs  in  0.5%  increments  for  grouping  of  sequences  into  similarity 
dusters 


diverse  clusters  are  formed  within  the  Pclagibacter  (SARl  1)  group 
of  the  alpha -Proteobactcria  (Fig.  2)  but  all  highly  represented 
lineages  contain  such  clusters,  including  the  gamma-Proteobacteria 
and  the  Bacteriodctes  group  (Supplementary  Information).  Never¬ 
theless,  microdiverse  clusters  are  not  uniformly  distributed  between 
lineages  in  the  modified  clone  library.  For  example,  the  Cytophaga 
group  contains  more  deeply  divergent  lineages  and  fewer  micro- 
diverse  clusters  than  the  Pclagibacter  group  (Supplementary  Infor¬ 
mation).  However,  such  differences  might  be  due  to  incomplete 
sampling  because  rarefaction  (Fig.  la)  suggests  that  deeper  branch¬ 
ing  lineages  in  this  library  are  well  sampled  and  that  additional 
sequencing  should  therefore  primarily  reveal  microdiverse 
ribotypes. 

To  what  extent  can  the  observed  ribotype  microdiversity  be 
explained  by  variation  among  paralogous  operons  within  single 
genomes'?  We  have  recently  explored  this  question  by  an  analysis  of 
97  available  complete  bacterial  genomes’.  These  contain,  because  of 
multiple  non-identical  operons,  a  total  of  242  different  16S  rRNA 
sequences’;  that  is,  the  number  of  ribotypes  exceeds  the  number  of 
genomes  about  2.5-fold.  Remarkably,  intcroperon  sequence  differ¬ 
ence  remains  within  about  1%  among  these  genomes.  Only  five 
genomes  deviate  from  this  rule,  four  of  which  were  thermophilic 
bacteria  all  with  a  single  operon  with  higher  sequence  divergence’. 
Therefore,  if  one  accepts  that  the  distribution  of  operons  among 
free-living  bacteria  is  similar  to  that  of  the  97  sequenced  genomes,  a 
conservative  correction  factor  of  about  2.5  (ref.  9)  can  be  applied  to 
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Figure  2  Phylogenetic  distance  relationships  between  the  coastal  bacterioplankton  based 
on  partial  16S  rRNA  sequencing  a.  Summary  of  groups  represented  in  the  sample,  in 
which  each  number  denotes  a  phylogenetic  cluster  ot  sequences  (for  identification  key 
see  Supplementary  Information)  b.  Relationships  between  PeiagiOacler  { SAR1 1 )  dusters 
represented  by  one  sequence  of  each  99%  similarity  cluster  Numbers  assexiated  with 
nodes  represent  bootstrap  values  c.  Examples  of  microdiverse  relationships  between 
SARI  1  ribotypes  Scale  bars,  0  i  (a).  0  05  (b)  and  0  01  (c)  substitutions  per  site  Arrows 
connecting  trees  point  to  expanded  nodes 


36 


letters  to  nature 


the  estimated  number  of  sequences  (1,113)  in  the  99%  similarity 
group  to  yield  a  revised  estimate  of  at  least  446  closely  related 
genomes  co-existing  in  the  sample.  However,  this  is  probably  an 
overcorrection  because  opportunistic  bacteria  with  multiple  oper- 
ons  are  thought  to  predominate  in  culture  collections  and  among 
sequenced  genomes9,10.  Moreover,  Pclngibacter  ubique  HTCC1062, 
which  is  identical  in  sequence  to  clones  within  the  largest  SAR11 
99%  similarity  clusters,  seems  to  contain  a  single  rRNA  operon 
(S.  J.  Giovannoni,  personal  communication).  Given  that  operon 
numbers  vary  little  between  closely  related  bacteria9  it  is  unlikely 
that  the  observed  SARI  1  microdiversity  can  be  explained  by  operon 
differences.  Finally,  shotgun  sequencing  of  Sargasso  Sea  prokaryotes 
revealed  a  total  of  about  1,400  rRNA  and  about  600  RecA 
sequences6.  The  latter  is  a  single-copy  gene  in  all  currently  published 
genomes.  Their  frequency  in  the  sample  therefore  provides  an 
independent  and  almost  identical  estimate  of  2.3  rRNA  operons 
per  genome.  Thus,  we  conclude  that,  even  after  conservative 
correction,  genomes  denoted  by  microdiverse  ribotypes  represent 
by  far  the  dominant  fraction  of  bacterial  diversity  in  this  coastal 
bacterioplankton  sample. 

The  observed  pattern  raises  the  question:  what  level  of  similarity 
should  be  expected  between  genomes  carrying  microvariant  ribo¬ 
types?  Comparative  genomics  has  shown  that  genomes  can  be 
divided  into  stable  and  variable  sets  of  genes,  termed  the  core  and 
flexible/auxiliary  genome,  respectively 'M\  The  latter  arises  primar¬ 
ily  by  means  of  phage  and  transposon- mediated  lateral  gene  transfer 
and  comprises  between  1  %  and  1 8%  of  genes' 1  but  possibly  as  much 
as  60%  (ref.  13).  The  core  genome,  in  contrast,  is  a  stable  com¬ 
plement  of  genes  that  includes  rRNA  and  housekeeping  genes.  This 
core  reflects  the  overall  evolutionary  history  of  the  lineage  because 
little  lateral  gene  transfer  is  detectable9  '3  '4.  Microdiverse  ribotype 
clusters  should  therefore  also  be  apparent  in  comparisons  of  other 
housekeeping  genes,  possibly  more  so  because  of  the  higher  substi¬ 
tution  rates  typical  of  protein  coding  genes\ 

Do  microdiverse  sequences  denote  co-existing,  ecologically  dif¬ 
ferentiated  genomes?  Among  free-living  bacteria  of  very  similar 
ribotypes,  correlation  of  genomic  variation  with  ecological  par¬ 
ameters  has  been  demonstrated  convincingly  in  only  a  single  case 
involving  two  strains  of  Prochlorococcns's,  but  these  would  not  fall 
within  a  single  microdiverse  cluster  as  defined  here.  In  contrast,  no 
evidence  of  functional  differentiation  was  detected  in  several 
environmental  BAC  clones  with  microdiverse  16S  rRNA,  despite 
considerable  polymorphisms  in  protein-coding  genes'6,17.  This  is 
'  consistent  with  recently  advanced  theories  for  the  interpretation  of 
microdiverse  sequence  clusters.  It  has  been  shown5  that  clustering  of 
housekeeping  genes,  resulting  from  periodic  selection,  predicts 
ecologically  differentiated  strains  within  cultivated  bacterial  taxa. 
If  microdiverse  ribotype  clusters  in  the  environment  arise  by  the 
same  mechanism'  their  very  existence  implies  that  intracluster 
;  competition  is  too  weak  to  sweep  members  from  within  their 
;  ranks.  However,  this  does  not  require  that  these  genomes  are 
functionally  identical.  Subdifferentiation  within  the  flexible  genome 
might  provide  increased  fitness  under  episodic  or  spatially  confined 
environmental  conditions,  but  not  sufficient  growth  advantage  to 
sweep  competing  microdiverse  genomes'3.  Furthermore,  ecological 
factors  might  decrease  effective  competition.  Particularly,  predation 
has  been  suggested  to  promote  the  coexistence  of  diverse  lineages  by 
‘killing  the  winner’  of  competitive  events'".  Finally,  recombination 
might  be  important  in  delineating  and  preserving  genetic  diversity 
among  members  of  clusters  by  allowing  sweeps  of  adaptive  alleles 
without  removing  selcctionally  neutral  variation'9. 

The  above  considerations  lead  us  to  suggest  that  microdiverse 
ribotype  clusters  are  important  units  of  differentiation  in  natural 
bacterial  communities.  Indeed,  such  clusters  might  be  widespread. 
We  have  recently  detected  numerous  microdiverse  ribotypes  in  salt- 
marsh  sulphate-reducing  bacteria30,  and  ribotype  clusters  have 
previously  been  tentatively  suggested  for  some  open-ocean 


microbial  groups'.  To  determine  whether  microdiverse  ribotype 
clusters  described  here  represent  ecotypes— that  is,  ecologically 
cohesive  populations— will  require  a  detailed  comparison  of  their 
encompassed  genomic  variation.  Indeed,  high-throughput  sequen¬ 
cing6  and  cultivation*'1  provide  the  means  for  rigorous  testing  of  the 
hypothesized  ecological  importance  of  ribotype  clusters.  Most 
importantly,  such  inquiries  challenge  us  to  re-examine  concepts 
of  microbial  diversity  and  invigorate  the  search  for  ecologically  and 
evolutionarily  defined  species  concepts.  □ 

Methods 

Study  site  and  sampling 

A  2.2-litrc  water  sample  was  collected  on  6  October  2001  from  the  marine  end  of  the  Plum 
Island  Sound  estuary  (northeastern  Massachusetts),  and  bacterioplankton  was 
concentrated  on  a  0.22-i»m  filter  (Supor:  Gelman).  which  was  stored  at  -  80  *C  until  DNA 
extraction  Measured  water  parameters  were  as  follows:  temperature  16  °C.  pH  1.0. 
prokaryotic  cell  numbers  0  99  x  10*1  \  dissolved  organic  carbon  0.4  mg  Cl  \ 
chlorophyll  a  5  94  ng  I 

DNA  extraction  and  clone  library  construction 

Cells  on  filters  were  lysed  and  nucleic  acids  were  extracted  with  a  modified  version  of  a 
bead-beating  method*’  followed  by  treatment  with  RNase  I  and  purifies  lion  on  Qiagen 
DNA  purification  spin  columns  Two  I6S  rRNA  clone  libranes  were  constructed  from  the 
same  sample  to  estimate  the  total  coexisting  sequence  diversity  and  the  effect  of  PCR- 
induced  artefacts.  For  both,  the  bactena-spccific  primers  27F  and  1 492R  as  modified  in  ref. 
22  were  used.  Each  PCR  reaction  contained  genomic  DNA  equivalent  to  4.9  x  10*  cells. 
Ten  replicate  reactions  were  combined  and  gel- purified,  and  the  same  amount  of 
amplicons  were  cloned  with  the  PGR  2.1-TOPO  kit  (Invitrogen)  The  PCR  amplification 
for  the  first  (standard)  library  used  35  cycles  mimicking  commonly  used  protocols 
(typically  between  30  and  40  cycles).  The  second  (modified!  library  was  constructed  to 
minimize  the  accumulation  of  the  three  known  PCR  anefacts  (Taq  errors,  chimaeras  and 
heteroduplex  molecules)*.  In  brief,  the  sample  was  amplified  for  1 5  cycles  followed  by  a  3- 
cycie  reconditioning  step",  which  eliminated  heteroduplex  molecules'*  and  decreased  the 
incidence  of  Taq  errors  and  chimaeras.  Purified  plasmids  served  as  templates  for  partial 
I6S  rRNA  sequence  determination  with  the  bacterial  primer  27F  (ref.  20) 

Sequence  analysis 

Sequences  (position  68  to  805.  £  co/i  numbering)  from  the  corrected  library  were  further 
analysed  for  evidence  of  PCR  artefacts.  About  3%  of  the  sequences  were  removed  as 
putative  chimaeras  on  the  basis  of  identification  by  a  combination  of  the  three 
biomformatics  tools  Chimera.Check*4.  Bellerophon*'  and  ChimeraBustcr''*.  Taq  errors 
were  identified  by  manual  reconstruction  of  I6S  rRNA  secondary  structures  as  pioneered 
in  ref.  26  and  detailed  in  ref.  20.  In  brief,  the  method  scores  as  Taq  errors  sequence  changes 
that  violate  either  the  sequence  conservation  rule  (nucleotides  that  are  different  in 
positions  more  than  98%  conserved  in  all  bacterial  sequences)  or  the  secondary-structure- 
conservation  rule  (apparent  changes  resulting  in  mismatches  in  stem  structures  that  are 
not  detected  in  related  sequences)  The  Taq  error  rate  determined  from  these  rules  was 
3.3  x  10  5  per  nucleotide  per  duplication,  which  agrees  remarkably  well  with  the 
experimentally  determined  value  of  2  x  10  'per  nucleotide  per  duplication  for  the  Taq 
polymerase  used*.  Further  confidence  that  the  large  majority  of  Taq  mors  are  captured  is 
lent  by  several  simple  considerations.  First,  inspection  of  alignments  showed  the 
remaining  variation  clustered  in  regions  known  to  be  highly  variable,  which  is  inconsistent 
with  the  expected  random  distribution  of  Taq  errors.  Second,  separate  quantification  of 
Taq  error  rate  for  positions  falling  under  the  conservation  and  the  second*  ry  structure  rule 
previously  gave  highly  similar  rates  of  1.7  x  10  '  and  2.5  X  10  \  respectively'1  Third, 
after  18  cycles,  the  inferred  Taq  error  rate  would  lead  to  a  misincorporat  ion  of  bases  at  a 
rate  of  3.6  x  10  4  per  nucleotide.  Because  about  800  base  pairs  of  sequence  reads  were 
obtained,  this  would  translate  into  an  average  of  0.3  errors  per  sequence  This  is  close  to  the 
fraction  of  sequences  (0.26;  181  of  686)  removed  from  the  modified  library  owing  to 
identification  of  Taq  errors.  Last,  potentially  undetectable  Taq  errors  by  the  secondary 
structure  rule  are  those  that  change  one  allowed  base  pairing  into  another  (for  example.  A - 
l)  to  G-U).  Although  this  can  happen  in  iwo-thirds  of  all  positions  in  rRNA.  only  30%  of 
the  time  will  random  replacement  of  one  nucleotide  by  another  due  to  Taq  error  result  in 
another  allowed  base  pairing.  Therefore,  at  worst  20%  (0.67  x  0.3)  of  Taq  errors  are 
missed  hut  since  only  about  64%  of  the  nucleotide  positions  fell  under  t  he  secondary- 
structure  rule  this  number  translates  into  about  13%  Considering  the  probable  incidence 
of  emirs  per  sequence  based  on  a  Taq  error  rate  of  2  x  10  s  the  calculation  again  results  in 
an  estimated  4%  of  sequences  (0.13  X  0.3)  that  carry  errors  missed  by  the  applied 
corrections. 

The  corrected  set  of  sequences  was  used  to  estimate  total  diversity  in  the 
bacterioplankton  community  and  patterns  of  relationships  An  algonthm  was 
developed'"  to  group  sequences  into  percentage  similarity  clusters  ( I00%» .  99%.  98%.  and 
so  on)  This  formed  the  basis  for  statistical  extrapolation  of  total  .sequence  diversity  with 
the  Chao- 1  (ref.  7)  and  N  ,/NMAX  (ref.  2)  estimators.  Accumulation  of  li  neages  through 
time  was  calculated  with  GENIF."  from  a  non-optimized  tree  inferred  by  maximum 
likelihood  with  the  molecular  clock  assumption  cnlorced*  Identification  of  phylogenetic 
affiliation  of  the  sequences  was  performed  with  the  neighbour-joining  method 
implemented  in  ARB*  and  followed  by  an  analysis  of  more  restricted  groups  of  sequences 
by  using  distance  and  parsimony  method'  in  l*AL’P*  (ref  29). 
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Cambrian  origins  and  affinities  of  an 
enigmatic  fossil  group  of  arthropods 

N.  L  Vaccari  ,  G.  D.  Edgecombe  it  C.  Escudero 

'instituto  dc  Geologia )  Mineria,  Umversidad  Nacional  dc  Jujuy,  Avert  xda  Bolivia  i 
1661,  San  Salvador  dc  Jujuy  (4600),  Argentina 
2 Australian  Museum,  6  College  Street,  Sydney,  NSW  2010,  Australia 
y  O' Higgins  y  Zarate,  108,  Palpala  (4612),  Jujuy,  Argentina 

Euthycarcinoids  are  one  of  the  most  enigmatic  arthropod  groups,  ! 
having  been  assigned  to  nearly  all  major  clades  of  Arthropoda. 
Recent  work  has  endorsed  closest  relationships  with  crustaceans' 
or  a  myriapod-hexapod  assemblage2,  a  basal  position  in  the 
Euarthropoda3,  or  a  placement  in  the  Hexapoda4  or  hexapod 
stem  group5.  Euthycarcinoids  are  known  from  13  species  ranging 
in  age  from  Late  Ordovician  or  Early  Silurian  to  Middle  Triassic, 
all  in  freshwater  or  brackish  water  environments'*.  Here  we 
describe  a  euthycaxcinoid  from  marine  strata  in  Argentina  dating 
from  the  latest  Cambrian  period,  extending  the  group's  record 
back  as  much  as  50  million  years.  Despite  its  antiquity  and 
marine  occurrence,  the  Cambrian  species  demonstrates  that 
morphological  details  were  conserved  in  the  transition  to  fresh 
water.  Trackways  in  the  same  unit  as  the  euthycarcinoid 
strengthen  arguments  that  similar  traces  of  subaerial  origin 
from  Cambro-Ordovician  rocks  were  made  by  euthycarcinoids7 
Large  mandibles  in  euthycarcinoids4,9  are  confirmed  by  the 
Cambrian  species.  A  morphology-based  phylogeny  resolves 
euthycarcinoids  as  stem-group  Mandibulata,  sister  to  the  Myr- 
iapoda  and  Crustacea  plus  Hexapoda. 

Mandibulata  Snodgrass,  1938 
Euthycarcinoidea  Gall  and  Grauvogel,  1964 
Euthycarciniformes  Starobogatov,  1988 
Apankura  gen.  nov. 

Etymology.  Apankura  (Quechua),  meaning  crab. 

Type  species.  Apankura  machu  gen.  et  sp.  nov. 

Diagnosis.  Euthycarciniform  with  large  mandibles  that  occupy 
most  of  the  space  beneath  the  posterior  cephalic  tergite;  anterior 
two  pairs  of  pre-abdominal  limbs  smaller  than  the  posterior  nine  | 
pairs;  limbs  markedly  taper  distally,  composed  of  about  ten  podo-  | 
meres,  distal  podomeres  arc  shorter,  large  setae  arc  absent;  at  least 
six  post-abdominal  segments;  post-abdominal  tergites  are  each 
about  2.5-times  wider  than  they  are  long. 

Apankura  machu  sp.  nov. 

Etymology.  Genus  as  above;  machu  (Quechua),  meaning 
grandfather. 

Holotype.  Museo  de  Gcologia,  Mineralogia  y  Paleontologia,  Uni- 
versidad  Nacional  de  lujuy  (JUY-P  24;  Fig.  1 ). 

Locality  and  horizon.  Bed  of  Rio  Huasamayo,  Garganta  del  Diablo, 
near  Tilcara,  Jujuy  Province,  Argentina.  The  holotype  (the  only 
known  specimen)  is  in  greenish-grey  mudstone  from  the 
Casa  Colorada  Member,  Santa  Rosita  Formation.  The  trilobites 
Neoparabolma  frequens  argenttna  and  Plicatohna  scalpta  on  the 
same  slab  indicate  a  latest  Cambrian  age  (lower  part  of  Ncopar- 
abolina  frequens  argentina  zone)10.  Green  shales  of  the  Casa  Color¬ 
ada  Member  represent  lower  offshore  deposition  in  an  open  marine 
facies". 

Diagnosis.  As  for  genus. 

The  holotype  is  38  mm  long,  including  the  head,  pre- abdomen 
and  six  segments  of  the  post-abdomen.  The  maximum  wi  dth  of  the 
pre-abdominal  tergites  is  16  mm.  As  in  other  euthycarcinoids'  ",  the 
head  is  composed  of  a  short  anterior  tergite  and  a  longer,  wider 
posterior  tergite.  The  latter  is  trapezoidal,  with  gently  curved  lateral 
margins.  The  antenna  is  uniramous,  with  at  least  nine  short  articles. 
Large,  well-defined  spheroidal  processes"  are  at  the  lateral  margin 
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Key  to  cluster  numbers  from  Figure  2a. 

Numbers  in  parentheses  represent  ribotypes  and  sequences  associated  with  each  cluster, 
respectively.  1,  Vibrionales  (33/40);  2,  Pseudoalteromonas/Shewanella  (12/14);  3, 
Uncultured  Gammal  (UGAMMA1)  (9/9);  4,  Uncultured  Gamma2  (UGAMMA2)  (8/12); 
5,  Methylomonas/Methylomicrobium  (8/24);  6,  Uncultured  Gamma3 
(UGAMMA3)/Symbionts  (13/20);  7,  Uncultured  Gamma4  (UGAMMA4)  (20/51);  8, 
Uncultured  Gamma5  (UGAMMA5)  (13/33);  9,  Cyclocasticus/Symbionts  (20/33);  10, 
Methylophylaceae  (Beta  Proteobacteria)  (17/29);  1 1 ,  Comamonadaceae  (Beta 
Proteobacteria)  (7/7);  12,  Uncultured  Actinomycetesl  (UACTINOl)  (12/45);  13, 
Cryobacterium  (8/1 1);  14,  Uncultured  Verrucomicrobiales  (7/10);  15,  Planctomyces 
(6/7);  16,  Fusibacter  (Clostridiales)  (2/2);  17,  Bacteriovorax  (Delta  Proteobacteria)  (8/8); 
18,  Arcobacter  (Epsilon  Proteobacteria)  (8/1 1);  19,  SAR  surface  group  (Alpha 
Proteobacteria)  (43/129);  20,  SAR  deep  group  (Alpha  Proteobacteria)  (17/25);  21, 
Uncultured  Alphal  (UALPHA1)  (2/5);  22,  Roseobacter/Roseovarius  (29/87);  23, 
Uncultured  Alpha2  (UALPHA2)  (8/10);  24,  Uncultured  Alpha3  (UALPHA3)  (1/1);  25, 
Uncultured  Alpha4  (UALPHA4)  (18/28);  26,  Uncultured  Alpha5  (UALPHA5)  (3/3);  27, 
Uncultured  Alpha6  (UALPHA6)  (6/9);  28,  Unknownl  group  (UNK1)  (5/17);  29, 
Unknown2  group  UNK2  (3/4);  30,  Uncultured  CFB1  group  (UCFB1)  (10/13);  31, 
Uncultured  CFB2  group  (UCFB2)  (7/22);  32,  Uncultured  CFB3  group  (UCFB3)  (5/5); 
33,  Uncultured  CFB4  group  (UCFB4)  (4/5);  34,  Cytophagalesl  (13/55);  35,  Polaribacter 
(5/8);  36,  Uncultured  CFB5  group  (UCFB5)  (28/56);  37,  Uncultured  CFB6  group 
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16S  rRNA  gene  sequences  (ribotypes)  retrieved  from  the  Plum  Island  bacterioplankton 
sample  and  representative  reference  sequences. 
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Beta  Proteobacteria 
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Chapter  Four 

Estimating  the  diversity  of  bacterial  communities 


To  be  submitted  with  Daniele  Veneziano  and  Martin  F.  Polz  as  co-authors. 
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Abstract 


Reliable  estimates  of  diversity  are  a  prerequisite  for  many  studies  of  community 
assembly,  environmental  function,  and  biogeography.  However,  for  microorganisms, 
diversity  assessments  have  only  recently  become  possible  through  the  advancement  of 
molecular  techniques  and  application  of  statistical  methods  generally  developed  for 
estimating  diversity  of  macroorganisms.  Currently,  the  most  commonly  used  diversity 
estimator  in  microbial  ecology  is  Chao  1,  however,  theoretical  studies  have  suggested  it 
underestimates  the  diversity  of  complex  microbial  communities.  This  is  confirmed  by  our 
analysis  of  a  large  dataset  of  16S  rRNA  sequences  derived  from  a  single  microbial 
sample.  Here,  we  consider  existing  parametric  approaches,  as  these  should  perform  better 
for  very  diverse  communities.  We  modify  them  to  better  account  for  the  specific  way 
microbial  communities  are  sampled.  The  number  of  taxa  and  other  model  parameters  are 
estimated  using  maximum  likelihood  (ML).  Of  the  tested  distributions  for  abundances  of 
bacterial  types,  the  lognormal  distribution,  which  is  commonly  used  for  microbial 
communities,  results  in  the  best  fit  to  our  data.  The  number  of  taxa  estimated  using  our 
parametric  approach  and  the  lognormal  distribution  is  one  order  of  magnitude  higher  than 
the  value  given  by  the  Chaol  estimator.  Through  simulation,  we  show  that  the  difference 
is  due  to  bias  of  the  latter  estimator  and  that  the  diversity  within  a  complex  marine 
microbial  community  is  considerably  higher  than  previously  observed. 


Introduction 


Prokaryotes  are  by  far  the  most  abundant  and  diverse  organisms.  Their  global  abundance 
is  estimated  to  be  4-6  x  1030  cells  (Whitman  et  al.,  1998)  and  molecular  surveys  have 
revealed  an  astounding  microbial  diversity  that  was  previously  undetected  by  culture- 
dependent  methods  (Head  et  al.,  1998;  Hugenholz  et  al.,  1998;  Rappe  and  Giovannoni, 
2003).  However,  the  quantification  of  microbial  diversity  has  remained  a  technical 
challenge  because  molecular  surveys  have  not  produced  data  sets  large  enough  to 
critically  evaluate  statistical  methods  that  were  only  recently  introduced  to  microbial 
ecology  (Hughes  et  al.,  2001). 

As  most  bacteria  remain  unculturable,  the  most  accurate  way  to  estimate 
microbial  diversity  is  to  use  sequence  data  from  genes  obtained  directly  from 
environmental  samples.  This  involves  extraction  of  environmental  DNA,  polymerase 
chain  reaction  (PCR)  amplification  of  target  genes  (most  commonly  ribosomal  16S  rRNA 
genes),  clone  library  construction  from  amplified  gene  fragments,  and  gene  sequencing 
(Head  et  al.,  1998).  Typically,  sequences  are  clustered  into  unique  rRNA  sequences 
(ribotypes)  or  clusters  of  ribotypes  based  on  1-5%  sequence  divergence  (Martin,  2002). 
The  diversity  of  ribotypes  or  clusters  in  a  clone  library  is  then  estimated  using  statistical 
approaches  (Dunbar  et  al.,  1999;  Hughes  et  al.,  2001;  Stach  et  al.,  2003). 

The  many  existing  methods  for  estimating  organismal  diversity  fall  into  two 
general  categories:  (i)  parametric  methods,  where  the  abundance  distribution  of  taxa  is 
assumed  to  have  a  specified  parametric  form,  and  (ii)  non-parametric  methods,  where  no 
abundance  distribution  model  is  assumed  (May,  1975;  Colwell  and  Coddington,  1994; 
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Krebs,  1999).  In  microbial  ecology,  the  most  commonly  applied  richness  estimator  is  the 
non-parametric  Chaol  estimator  (Hughes  et  al.,  2001;  Bohannan  and  Hughes,  2003).  It 
calculates  diversity  of  organisms  based  on  the  number  of  taxa  represented  in  the  sample 
by  one  individual  (singletons)  and  two  individuals  (doubletons)  (Chao,  1984,  1987). 

Since  the  Chaol  estimator  depends  only  on  the  number  of  singletons  and  doubletons,  it  is 
very  simple  to  use.  However,  it  has  been  noted  that  the  Chaol  estimator  gives  biased 
(low)  estimates  of  diversity,  especially  for  very  heterogeneous  communities  (Mao  and 
Lindsay,  2001;  Bohannan  and  Hughes,  2003). 

Parametric  analysis  has  only  rarely  been  used  to  estimate  microbial  diversity.  A 
major  reason  is  that  parametric  methods  are  computationally  more  demanding  than  non- 
parametric  alternatives.  In  addition,  it  has  been  argued  that  the  need  to  assume  an 
underlying  abundance  distribution  of  taxa  in  a  clone  library  is  disadvantageous  because 
existing  data  sets  are  not  large  enough  to  support  the  choice  of  a  particular  abundance 
model  (Bohannan  and  Hughes,  2003).  However,  if  a  distribution  of  abundances  can  be 
chosen  based  on  theoretical  or  empirical  observations,  parametric  models  should  estimate 
the  diversity  more  accurately  than  non-parametric  approaches. 

In  this  paper  we  show  that  for  complex  microbial  communities  the  commonly 
used  Chaol  estimator  is  highly  biased  and  severely  underestimates  diversity.  Since  our 
data  set  is  the  largest  derived  from  a  single  environmental  sample  to  date  (sample  size  = 
1,033  16S  rRNA  sequences),  we  consider  as  an  alternative,  various  parametric 
maximum-likelihood  estimators  to  better  account  for  the  way  microbial  communities  are 
sampled.  We  compare  the  modified  parametric  estimators  with  existing  ones  and  the 
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Chaol  estimator,  and  evaluate  the  performance  of  different  estimators  by  sampling  from 
the  bacterioplankton  data  set  and  simulated  communities. 

Modification  of  a  parametric  model  and  ML  formulation 

Here  we  consider  (i)  the  formulation  of  parametric  abundance  models  and  sampling 
methods  appropriate  for  microbial  communities,  and  (ii)  estimation  of  the  model 
parameters  using  maximum  likelihood.  We  first  describe  abundance  distribution  models 
for  taxa  in  the  population  and  the  sample.  This  is  necessary  because,  the  distribution  of 
taxon  abundance  in  the  sample  depends  on  the  distribution  of  taxon  abundance  in  the 
population  and  the  sampling  procedure  used.  Second,  we  provide  a  detailed  description  of 
the  maximum  likelihood  formulation.  Lastly,  we  evaluate  how  well  different  parametric 
models  fit  the  data  and  compare  the  results  with  diversity  estimates  obtained  by  non- 
parametric  approaches. 

Distribution  of  taxon  abundances  in  a  population 

Clone  libraries  constructed  from  complex  natural  environments  are  typically  highly 
heterogeneous,  being  comprised  of  a  few  dominant  and  many  rare  taxa  (Dunbar  et  al., 
1999;  Stach  et  al.,  2003;  Acinas  et  al.,  2004a;  Venter  and  al.,  2004).  In  nature, 
abundances  vary  significantly  from  taxon  to  taxon  and  can  be  represented  by  a  random 
variable  with  a  particular  distribution  form.  Lognormal  and  gamma  distributions  have 
been  widely  used  for  highly  heterogeneous  communities  of  macroorganisms  and  are 
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reviewed  below  (Fisher  et  al.,  1943;  Preston,  1948;  Kempton  and  Taylor,  1974;  May, 
1975). 


The  lognormal  distribution  has  probability  density  function: 


-(In  r-^)a 

,  [1] 

where  ln(r)  is  normally  distributed  with  mean  value  n  and  variance  o2. 

It  has  been  argued  that  the  abundances  of  microbial  taxa  should  be  lognormally 
distributed  (Curtis  et  al.,  2002).  For  example.  May  (1975)  proposed  that  the  highly 
dynamic  and  random  growth,  such  as  that  commonly  observed  for  microorganisms  in 
natural  communities,  should  lead  to  a  lognormal  distribution  of  taxa.  Furthermore, 
multiple  biotic  and  abiotic  interactions,  such  as  nutrient  composition  and  availability, 
light,  temperature,  and  competition,  should  also  lead  to  such  a  distribution  (May,  1975). 

Another  commonly  used  abundance  model  is  the  gamma  distribution,  with 
density: 


4(r)  = 


1 


AF  (a) 


■  ra-'e 


[2] 


where  T(a)  is  the  gamma  function,  a>0  is  a  shape  parameter,  and  /3>0  is  a  scale 
parameter.  The  variable  r  in  Eq.  [2]  has  mean  value  afiR  and  variance  afil  (Engen  and 
Lande,  1996;  Diserud  and  Engen,  2000). 

There  are  two  special  cases  of  the  gamma  distribution,  which  have  been  proposed 
for  highly  heterogeneous  communities:  the  broken  stick  distribution  (Mac Arthur,  1957) 
and  the  logarithmic  distribution  (Fisher  et  al.,  1943).  When  the  shape  parameter  a 
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equals  one,  the  gamma  distribution  is  exponential  (Plotkin  and  Muller-Landau,  2002), 
which  is  a  continuous  version  of  MacArthur’s  broken  stick  model  (Cohen,  1968). 

Mac  Arthur’s  broken  stick  distribution  is  to  be  expected  when  there  is  a  fixed  amount  of 
some  governing  resource  that  ecologically  homogeneous  taxa  divide  up  among 
themselves  in  an  independent  way  (May,  1975).  When  a  -*  0,  the  gamma  distribution 
becomes  a  logseries  distribution  (Plotkin  and  Muller-Landau,  2002),  which  is  a 
continuous  version  of  Fisher’s  logarithmic  (logseries)  model  (Fisher  et  al„  1943). 

Fisher’s  logarithmic  model  is  appropriate  for  taxa  that  compete  for  a  single  resource,  as 
in  the  broken  stick  model,  but  ecologically  homogeneous  taxa  divide  up  the  resource  in 
some  dependent  fashion  (May,  1975). 

Distribution  of  taxon  abundances  in  a  sample  and  ML  parameter  estimation 
The  number  of  individuals  of  a  given  taxon  observed  in  a  sample  is  a  random  variable 
whose  distribution  depends  on  the  total  abundance  of  that  taxon  in  the  sampled 
community,  the  sample  size,  and  the  method  of  sampling.  Given  the  abundances  of  the 
taxa  in  the  community,  the  corresponding  abundances  in  the  sample  are  typically 
assumed  to  have  independent  Poisson  distributions  (Preston,  1948;  Bulmer,  1974).  While 
often  not  exact,  the  independent  Poisson  assumption  leads  to  relatively  simple  analysis 
and  gives  a  good  approximation  to  more  accurate,  but  more  complex  sampling 
distributions. 

Sampling  methods  for  macroorganisms  are  generally  different  from  those  for 
microorganisms.  There  are  two  main  methods  of  sampling:  exhaustive  or  complete  (in 
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time  or  space)  and  random  with  fixed  size  (number  of  sampled  individuals).  Plants  and 
insects  are  typically  sampled  exhaustively  inside  quadrats  and  from  traps,  respectively. 

In  this  kind  of  sampling,  the  total  sample  size  is  not  known  a  priori  as  everything  within 
a  given  space  or  time  volume  is  collected.  In  contrast,  clone  libraries  are  often  sampled 
using  a  fixed  sample  size.  As  we  shall  see,  main  difference  between  the  two  sampling 
strategies  is  in  the  mean  value  of  the  sample  abundances. 

Compound  Poisson  model  for  exhaustive  sampling 

We  first  describe  the  maximum-likelihood  formulation  for  exhaustive  sampling  and  then 
show  how  the  formulation  is  modified  when  the  sample  size  is  controlled. 

A  likelihood  method  for  exhaustive  independent  Poisson  sampling  was  recently 
described  by  Mao  and  Lindsay  (2001).  This  model  applies  when  the  space-time 
distributions  of  individuals  belonging  to  different  taxa  are  independent  Poisson  variables 
and  sampling  is  exhaustive  inside  a  certain  region.  Let  N  be  the  unknown  total  number  of 
taxa  to  be  estimated  and  denoted  by  m,  the  sample  abundance  for  taxon  i  with  possible 
values  0,  1,2, ...  The  number  of  taxa  with  sample  abundance  m=m  is  denoted  by  nm. 

oo 

Hence  nm  =  N .  In  this  case,  that  n„  n2, . . .,  are  observed  but  n0  is  not,  and  that 

m-1 

00 

^nm=n,  where  n  is  the  total  number  of  taxa  observed  in  the  sample.  In  the  exhaustive 

m-1 

sampling  model,  the  m-  s  are  taken  to  be  Poisson  random  variables  with  mean  values  Xp 
i  =  1 .  .N.  The  mean  values  are  proportional  to  the  population  abundances  and  are 
treated  as  independent  random  variables  with  the  same  probability  distributions  (usually 


57 


lognormal  or  gamma)  with  Q(  X).  Given  Q,  the  probability  f(m\Q)  that  a  generic  taxon  i 
has  sample  abundance  m,  =  m  is: 

f{m\Q)  =  ]—Xme^dQ{X).  [3] 

o  m- 

In  this  case  f(m\Q)  does  not  depend  on  the  number  of  taxa  in  the  community,  N. 
The  likelihood  function  of  (N,Q)  has  the  form: 


l(N,Q  I {nm})  «  f(n0  =  N-  n,{nm,m  > 0}  I N,Q )  = 
N\ 


(N-nyftnJ 


war-UAmw 


[4] 


m-1 


The  factor 


N\ 


(W -«)![]»-! 


in  eqn  [4]  is  a  multinomial  coefficient,  which  gives  the  number 


of  ways  in  which  sample  abundances  m„  i  =  1,.,.^/V  can  be  ordered.  The  likelihood 
function  in  eqn  [4]  can  be  factored  into  a  binomial  probability  for  the  number  of  distinct 
taxa  observed  in  the  sample  ( n )  and  a  multinomial  probability  for  the  observed  frequency 
counts  {nm,m  >  0}  given  n.  These  two  factors  are 

M I 

f(n  I  N,Q)  =  ■■■,/(  0  I  Qf-n[\  -  f(0  I  Q)]n  [5] 

n\(N-n)\ 

and 


f({nm,m>0}\N,Q,n)  = 


-  Fff  ' 


[6] 
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respectively.  Note  that,  in  the  likelihood  function,  the  factor 


in  eqn  [6]  may  be 


mm  1 


omitted. 


The  probabilities  in  eqn  [6]  do  not  depend  on  the  total  number  of  taxa  in  the 
community  N.  However,  the  distribution  Q  appears  in  both  likelihood  components,  and 
thus  inference  of  Q  must  be  based  on  the  entire  likelihood  function  (eqn  [4]). 

Compound  Poisson  model  for  fixed  sample  size 

Since  clone  libraries  contain  a  finite  number  of  individuals  and  are  sampled  without 
replacement,  for  given  population  abundances  Mh  A/2,...,  MN,  the  abundances  mh 
mN,  in  a  sample  of  fixed  size  s  have  hypergeometric  distribution.  However,  because  of  the 
large  number  of  individuals  in  each  taxon,  the  hypergeometric  distribution  may  be 
replaced  with  a  multinomial  distribution,  which  describes  sampling  with  replacement  or 
sampling  from  an  infinite  number  of  individuals.  Furthermore,  if  the  sample  size  s  is 
large,  the  multinomial  distribution  of  the  m-  s  may  be  approximated  with  an  independent 


approximation,  the  sample  size  is  not  fixed  but  is  a  random  variable  with  Poisson 
distribution  and  mean  value  equal  to  the  actual  sample  sizes.  The  coefficient  of  variation 
is  1/Vs  and  is  therefore  small  for  sample  sizes  of  the  order  of  1,000  individuals,  used 
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here  to  characterize  microbial  diversity.  If  the  M’s  are  independent  and  identically 
distributed,  the  R’s  are  also  identically  distributed,  but  are  dependent.  For  simplicity,  we 
assume  that  the  R’s  are  also  independent  with  some  distribution  Q,  inherited  from  the 
distribution  of  the  abundance  Aff. 

Under  the  above  modeling  assumptions  and  approximations,  the  probability  that  a 
taxon  is  observed  m  times  is  given  by: 


[7] 


Notice  that,  contrary  to  the  exhaustive  sampling,  f(m)  now  depends  not  only  on  Q  but 
also  on  N.  In  this  case,  the  likelihood  function  for  N  and  Q  is  given  by: 


The  likelihood  function  may  be  factored  in  analogy  with  eqns  [5]  and  [6].  To 


avoid  calculation  of  the  binomial  coefficient - : - in  eqn.  [6]  one  can  approximate 

n\(N  -n)\ 


the  binomial  probability  of  f(n\N,Q)  with  the  density  at  n  of  the  normal  distribution 
having  the  same  mean  and  variance 


E[n\N,Q]  =  N[\-f(0\N,Q)] 

Var[n  \N,Q]  =  Nf( 0  I  N,Q)[  1  -  /( 0 1  JV,<2>] 


[9] 


This  gives: 


(n-Ml-/(0W,a)l); 


•[10] 
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Eqn.  [10]  is  similar  to  the  likelihood  explained  by  Bulmer  (1974).  However, 
Bulmer’s  formulation  does  not  contain  the  first  likelihood  term  /(«  \N,Q).  Bulmer 
(1974)  stated  that  the  first  part  of  the  distribution  is  unnecessary,  since  11^,  the  number  of 
taxa  not  represented  in  the  sample,  is  unknown.  A  comparison  of  the  likelihood  functions 
and  the  associated  ML  estimates  of  N  and  a  will  be  made  later  in  this  chapter. 

Next,  we  specialize  the  present  formulation  for  the  cases  when  Q  is  a  lognormal 
or  a  gamma  distribution. 


/fml/V.O)  for  Poisson-lognormal  model 

Suppose  that  the  relative  abundances  R's  have  lognormal  distribution.  Specifically,  In/?, 


has  normal  distribution  with  variance  o 2  and  mean  ju  =  -^a2,  so  that  E[/?,]=  1 .  Then  R 


has  probability  density  function: 


q(r)  = 


ro 


VST 


-(In 


and  eqn.  [7]  becomes: 


f(m  I  N,Q)  =  J({-)V"  — U= 

m!  o  N  ro^lji 


-flu  »•■*—•  )* 

e  dr. 


[11] 


[12] 
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f(m\N.O)  for  Poisson-Gamma  model 

Suppose  now  that  the  relative  abundances  /?,’ s  have  gamma(  a,fiR )  distribution  with  unit 
mean  value  and  variance  o\.  Since  this  distribution  has  mean  value  afiR  and  variance 
afil,  it  follows  that 


a  = 


J_  =  J_ 

Pr 

Pr  =  °l 


[13] 


It  also  follows  that  the  mean  sample  abundances  A,  =  /?,  —  are  iid  with  gamma(a,/3) 
distribution  and  parameters 


a  = 


J _ 1_ 

Pr  °r 


p  =  —pR  =  —o\ 
N  R  N  R 


[14] 


In  this  case,  the  compound  Poisson-gamma  distribution  in  eqn  [7]  is  a  negative  binomial 
distribution  (Fisher  et  al.,  1943).  Specifically, 

fa  +  m-lV  p  YV  1  Y 


f(m 


where  6  =  0R  — ,  and 
R  N 


a- 1 


^  +  1/  1^  +  1/ 


[15] 


a  +  m-l\  r(a  + m)  _  a  +  m -1  a  +  m-2  a 

a- 1  I  f(a)/n!  m  m  - 1  1 
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Results 


Analysis  of  a  large  microbial  data  set  and  validation  of  the  modified  parametric  ML 
estimators 

We  apply  the  diversity  estimates  derived  above  to  the  largest  available  16S  rRNA  clone 
library  from  a  single  microbial  sample  (Acinas  et  al.,  2004b).  The  library  was  constructed 
from  a  complex  marine  bacterioplankton  community  collected  on  6  October  2001  from 
the  marine  end  of  the  Parker  River  Estuary,  MA.  The  sample  consists  of  1,033  16S 
rRNA  gene  sequences.  Among  these,  516  are  unique  rRNA  sequences  (ribotypes)  and 
approximately  50%  of  these  sequences  occurred  only  once  in  the  sample.  The  observed 
values  of  ribotype  abundances  are  given  in  Table  1.  Construction  of  the  16S  rRNA 
library,  corrections  for  sequence  artifacts,  and  a  detailed  phylogenetic  analysis  of  this 
bacterioplankton  data  set  is  described  elsewhere  (see  chapter  3). 

The  maximum  likelihood  of  the  bacterioplankton  data  set  applying  the  Poisson- 
lognormal  model  estimated  25,000  ribotypes  with  the  standard  deviation  of  the  taxa  log- 
abundances  of  2.7.  The  likelihood  for  our  Poisson-lognormal  model,  eqn  [8],  is  shown  in 
Figure  1 A  as  a  function  of  number  of  distinct  taxa  in  the  library,  N,  and  the  standard 
deviation  of  the  taxa  log-abundances,  o.  This  likelihood  is  the  product  of  two  terms, 
which  are  given  by  eqns  [6]  and  [5]  and  plotted  separately  in  Figures  IB  and  1C, 
respectively.  Figure  IB  corresponds  to  Bulmer’s  (1974)  likelihood  formulation  and 
ignores  the  fact  that  (N-n)  taxa  were  not  observed  in  the  sample.  This  likelihood 
component  constrains  well  the  total  number  of  taxa  N,  but  is  less  informative  on  (o|N). 
This  can  be  seen  from  the  separation  of  the  contour  lines  in  the  vertical  direction  (Fig. 
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IB).  The  second  component,  eqn  [5]  and  Figure  1C,  is  also  important:  it  provides  little 
additional  constraint  on  N,  but  imposes  a  clear  relation  between  a  and  N.  Thus,  both 
components  are  necessary  to  describe  maximum-likelihood  formulation  for  random 
sampling  with  fixed  number  of  sampled  individuals. 

The  bias  and  variance  of  ML  estimators  in  our  likelihood  formulation  of  N  and  o 
and  the  ML  estimators  in  Bulmer’s  formulation  were  evaluated  using  simulated 
communities  and  samples.  We  simulated  10  communities  with  N  and  a  set  to  25,000  and 
2.7,  which  are  the  ML  estimates  obtained  assuming  the  lognormal  distribution  (Fig.  1  A). 
From  each  simulated  community,  10  samples  of  1,033  individuals  were  drawn  at  random, 
as  in  the  original  bacterioplankton  data  set.  For  each  sample,  the  ML  estimates  using  eqn 
[6]  (Bulmer’s  likelihood  formulations)  and  eqn  [10]  (our  likelihood  formulation)  were 
obtained  (Fig.  ID).  The  average  value  and  the  standard  deviation  of  the  estimates  of  the 
10  populations  obtained  for  both  parametric  ML  estimators  are  shown  in  Figure  ID.  The 
average  values  of  N and  o of  100  ML  estimates  were  23,061±2.47xl08(lo2)  and 
2.56±0.041(lcr)  for  our  likelihood,  and  29,301±4.60xl08(lo2)  and  2.69±0.075(1  cr)  for 
Bulmer’s  likelihood,  respectively.  Thus,  the  variances  were  smaller  for  the  our 
likelihood.  Even  the  average  variance  of  N  and  a  of  the  10  population  variances  is  lower 
for  our  likelihood  (1.32xl08  and  0.024)  than  for  the  Bulmer’s  likelihood  formulation 
(2.81xl08  and  0.052).  Thus,  although  the  two  functions  are  comparable,  the  modified 
likelihood  results  in  lower  uncertainty  of  the  maximum  likelihood  estimate  of  N  and  o. 

The  bias  and  variance  was  also  evaluated  for  the  Chaol  estimator  using  the  same 
simulated  communities  and  samples  as  with  the  parametric  ML  estimators  (Fig  ID).  For 
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each  sample,  the  Chaol  estimates  were  obtained,  averaged  and  compared  with  the  values 
obtained  by  ML  for  the  bacterioplankton  sample  (Fig.  ID).  The  average  of  the  10  Chaol 
values  obtained  from  simulated  lognormal  communities  equals  1673  ±133(1  cr2)  ribotypes. 
Thus,  the  Chaol  values  are  more  than  5  standard  deviations  away  from  the  mean  of  the 
MLE  for  the  bacterioplankton  data  and  simulated  communities.  Overall,  this  suggests 
that  the  ribotype  abundances  in  the  library  are  well  explained  by  the  lognormal 
distribution  and  that  the  Chaol  significantly  underestimates  this  diversity. 

To  validate  the  lognormal  abundance  model,  we  compared  the  empirical  expected 
rarefaction  curve  with  the  theoretical  mean  curve  for  the  lognormal  model  and  the 
gamma  model  using  the  modified  parametric  ML  approach.  The  Poisson-lognormal 
model  (Fig.  1  A)  has  the  best  fit  to  our  data  set  for  N  equals  25,000  ribotypes  and  a  of  the 
lognormal  distribution  equals  2.7  (ML  value  equals  -561.86).  The  ±1  standard  deviation 
(a)  away  from  the  MLE  mean  ranges  from  (A= 13,500,  ln(o)=2.4)  to  (A/=49,000, 
ln(o)=2.9).  For  the  Poisson-gamma  model  (Fig.  2)  the  MLE  is  5.91x10'°  ribotypes  and 
the  ln(cr)  equals  nine  (ML  value  >-630).  The  ±lo  ranges  from  (M=1.36xl05,  ln(o)=2.9) 
to  some  unbounded  value  of  N.  Using  the  underlying  distribution  (lognormal  or  gamma) 
and  the  corresponding  MLE  value,  one  can  generate  a  rarefaction  curve  by  calculating  the 
observed  taxa  in  the  sample  given  the  sample  size.  Comparing  such  rarefaction  curves  to 
a  rarefaction  curve  of  the  data  as  well  as  the  ML  values,  we  determined  that  the 
lognormal  abundance  model  fits  the  clone  library  data  most  closely  (Fig.  3A). 

Poisson-gamma  model  fits  the  ribotype  data  more  poorly  than  the  Poisson- 
lognormal  model.  The  MLE  for  the  Poisson-gamma  model  (Fig.  2)  gives  a  very  large 
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number  of  ribotypes,  >1015,  which  exceeds  the  number  of  individuals  present  in  the 
sample  and  therefore  is  not  believable.  Interestingly,  with  essentially  equal  confidence 
(within  one  o away  from  the  mean  of  the  MLE),  N  can  have  values  betweenlO5  and  >1015 

N 

(Fig  2.).  These  estimates  lie  on  the  ridge  defined  by  the  —  =constant,  where  N  is  the 

o 

total  number  of  taxa  in  a  sampled  community  and  o2  is  the  variance  of  the  gamma 
distribution.  However,  the  Poisson-gamma  fits  the  rarefaction  curve  badly  (Fig.  3A),  and 
thus,  we  can  discount  these  results.  The  two  special  cases  of  the  Poisson-gamma 
distributions,  Fisher’s  logarithmic  (Fisher  et  al.,  1943)  and  MacArthur’s  broken  stick 
(MacArthur,  1957)  distributions,  fit  the  rarefaction  curve  even  more  poorly  (Fig.  3 A). 

A  comparison  of  the  Chaol  and  Curtis  estimators  to  the  Poisson-lognormal  model 
The  best  estimate  of  the  number  of  ribotypes  in  the  library  using  the  Poisson-lognormal 
model  (25,000  ribotypes)  is  over  one  order  of  magnitude  higher  than  the  value  of  1,633 
obtained  by  applying  the  Chaol  estimator  (Chao,  1984,  1987).  We  fixed  the  value  of  the 
total  number  of  ribotypes  to  the  Chaol  value  of  1,633  and  under  this  constraint 
determined  the  most  likely  value  of  the  associated  standard  deviation  of  the  lognormal 
distribution  (a).  Using  the  lognormal  distribution  and  the  corresponding  MLE  value,  we 
generated  a  rarefaction  curve  for  the  Chaol  by  calculating  the  observed  taxa  in  the 
sample  given  the  sample  size.  The  expected  rarefaction  curve  for  the  Chaol  for  the 
lognormal  community  is  given  in  the  Figure  3 A.  The  shape  of  the  Chaol  rarefaction 
curve  is  more  concave  down  compared  to  the  shape  of  the  ribotype  accumulation  curve 
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because  it  has  a  lower  asymptote  (Fig.  3A).  Also,  in  Figure  3A,  we  show  the  fit  to  the 
rarefaction  curve  of  another  estimator,  recently  developed  by  Curtis  et  al.  (Curtis  et  al., 
2002),  which  is  a  parametric  estimator  based  on  the  lognormal  distribution.  This 
estimator  uses  only  the  highest  observed  abundance  and  the  total  number  of  individuals 
in  a  sampled  population.  It  resulted  in  an  estimate  of  2,236  ribotypes  for  our  data  set. 

This  value  is  still  an  order  of  magnitude  lower  that  that  of  the  ML  value.  Although  the 
Curtis  estimator  uses  an  underlying  distribution  (lognormal),  it  may  suffer  from  the  same 
problem  as  the  Chaol.  Since  both  of  these  estimators  use  only  partial  information  from 
the  sample,  this  may  suggest  one  reason  for  the  unreliable  estimates. 

Analysis  of  clustered  data 

The  MLE  values  and  comparison  of  the  fits  for  the  Poisson-lognormal  and  the  Poisson- 
gamma  models  were  also  obtained  for  the  data  set  after  clustering  ribotypes  into  groups 
of  sequences  that  were  >97%  identical,  i.e.  97%  similarity  groups.  Using  the  Poisson- 
lognormal  model,  the  number  of  97%  similarity  groups  is  estimated  to  be  about  1,500 
taxa.  The  standard  deviation  of  log-abundance  distribution  is  estimated  to  be  2.7. 
Although  the  Poisson-lognormal  model  resulted  in  the  best  fit  also  for  this  data  set,  the  fit 
to  the  97%  rarefaction  curve  is  not  as  tight  as  that  to  the  100%  (ribotype)  rarefaction 
curve  (Figs.  3A  and  3B). 
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Effect  of  sample  size 

The  sample  size  of  1,033  is  larger  than  commonly  used  in  microbial  community  analysis. 
It  is  thus  of  interest  to  investigate  whether  the  ML  estimator  for  the  Poisson-lognormal 
model  performs  adequately  for  smaller  sample  sizes.  Thus,  we  randomly  chose  500 
sequences  from  the  actual  sample.  This  resulted  in  similar  maximum  likelihood 
estimates  of  around  25,000  ribotypes,  but  displayed  higher  uncertainty  due  to  smaller 
sample  size  (data  not  shown). 

We  were  further  interested  in  how  sensitive  the  Chaol  estimators  are  to  sample 
size  and  for  what  sample  sizes  the  Chaol  estimators  eventually  become  unbiased.  For 
this  purpose,  we  sampled  the  same  simulated  community,  with  sample  sizes  ranging  from 
25  to  7.5xl06  individuals.  Two  Chaol  estimators  were  applied  (Fig.  4):  the  commonly- 
used  uncorrected  (Chao,  1984)  and  the  bias  corrected  (Colwell,  1997).  Since  bias- 
corrected  Chaol  estimator  has  been  rarely  used,  we  used  the  uncorrected  Chaol  estimator 
in  all  of  our  other  analyses.  The  formulas  for  the  two  estimators  are  given  in  the  legend 
of  Figure  4.  Although  both  estimators  significantly  underestimate  diversity  for  sample 
sizes  smaller  that  105  individuals,  the  bias-corrected  Chaol  estimate  gives  higher 
estimates  than  the  uncorrected  Chaol  for  sample  sizes  smaller  than  103  individuals  (Fig. 
4).  However,  for  sample  sizes  larger  than  103,  the  uncorrected  Chaol  gives  higher 
estimates  than  the  bias-corrected  Chaol  (Fig.  4),  thus  displaying  less  bias  than  the  bias- 
corrected  Chaol.  Surprisingly,  only  for  the  largest  sample  sizes  (>106)  the  Chaol 
estimates  the  sample  size  within  the  same  order  of  magnitude.  Therefore,  especially  for 
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the  small  sample  size,  both  versions  of  the  Chaol  estimator,  uncorrected  or  bias- 
corrected,  greatly  underestimate  the  ribotype  diversity  of  diverse  clone  libraries. 


Discussion 


Estimating  diversity  is  a  basic  task  in  ecology,  yet  for  microbial  communities  it  has 
remained  elusive.  Non-parametric  estimators  (e.g.,  Chao  1)  have  started  to  be  used  to 
evaluate  microbial  diversity  because  of  reported  low  bias  (Hughes  et  al.,  2001;  Hill  et  al., 
2003;  Kemp  and  Aller,  2004).  However,  we  have  observed  from  simulated  communities 
whose  species  abundances  were  based  on  our  data  that  the  Chaol  is  usually  negatively 
biased.  Although  Mao  and  Lindsay  (2002)  have  shown  that  Chaol  is  generally  biased  for 
heterogeneous  communities,  the  extent  of  under-estimation  has  been  overlooked.  To 
obtain  more  accurate  estimates  of  microbial  diversity,  we  have  turned  to  parametric 
models.  Given  our  large  sample,  parametric  models  have  the  advantage  over  non- 
parametric  models,  that  the  underlying  distribution  of  taxa  abundances  can  be  inferred 
from  the  sample  and  this  information  can  be  used  to  more  accurately  estimate  the  total 
number  of  taxa.  Specifically,  we  have  used  a  parametric  approach  based  on  maximum 
likelihood,  which  consists  of  (i)  a  distribution  model  of  the  abundances  of  taxa  in  a 
community,  and  (ii)  a  sampling  model  appropriate  for  microbial  communities. 

We  have  concentrated  our  investigations  on  two  commonly  used  abundance 
distribution  models:  the  lognormal  (Preston,  1948)  and  the  gamma  distribution  (Fisher  et 
al.,  1943).  It  has  been  suggested  that  abundances  of  taxa  in  microbial  communities  are 
lognormally  distributed  (Dunbar  et  al.,  1999;  Curtis  et  al.,  2002).  Our  data  set  is  indeed 
best  described  by  the  Poisson-lognormal  distribution,  suggesting  that  the  lognormal 
distribution  is  the  underlying  distribution  of  the  sampled  clone  library  (Fig.  3A). 
Compared  to  the  Poisson-lognormal,  the  Poisson-gamma  distributions  have  resulted  in 
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inferior  fit  (Fig.  3A).  We  have  tested  the  gamma  model  because  of  its  suggested  good  fit 
to  heterogeneous  communities  (Kempton  and  Taylor,  1974).  Two  special  cases  of  the 
Poisson-gamma,  Fisher’s  logarithmic  (Fisher  et  al.,  1943)  and  MacArthur’s  broken  stick 
(MacArthur,  1957),  may  be  viewed  as  distributions  characteristic  of  relatively  simple 
communities  whose  dynamics  are  dominated  by  a  single  factor  (May,  1975).  Fisher’s 
logarithmic  model  has  been  criticized  because  fitting  the  log-series  distribution  presumes 
an  infinite  pool  of  species  available  for  sampling  (Kempton  and  Taylor,  1974).  It  is 
unlikely  that  complex  microbial  communities  are  governed  by  a  single  factor,  so  Fisher’s 
and  MacArthur’s  models  are  likely  inappropriate  for  modeling  abundances  of  taxa  and  in 
turn,  for  estimating  diversity  from  a  sample.  Thus,  based  on  our  results  and  theoretical 
grounds,  we  can  confidently  discount  the  gamma-based  distributions  for  complex 
microbial  communities. 

We  tested  the  sensitivity  of  the  estimates  from  the  Poisson-lognormal  diversity 
model  relative  to  (i)  sub-sampling  of  simulated  communities,  and  (ii)  sub-sampling  of  the 
actual  bacterioplankton  data.  The  results  from  the  simulated  populations  were 
comparable  to  those  observed  for  the  bacterioplankton  library  data  set,  falling  within  two 
standard  deviations  from  the  mean  of  the  maximum  likelihood  estimate  for  the  data  (Fig. 
4).  Similar  values  of  ~25,000  ribotypes  were  obtained  for  random  sub-sampling  of  500 
individuals  from  the  bacterioplankton  data  set,  but  yielded  higher  uncertainty  (data  not 
shown).  Therefore,  the  tests  of  reproducibility  suggest  that  our  model  keeps  performing 
adequately  for  the  smaller  sample  size. 
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Based  on  our  findings,  the  bias  of  the  Chaol  estimator  for  diverse  data  sets  is 
much  higher  than  previously  recorded.  The  extent  of  bias  associated  with  diverse  libraries 
was  measured  using  the  Chaol  estimator  for  samples  ranging  from  50  to  7.5  x  107 
individuals  randomly  sampled  from  simulated  communities  (Fig.  4).  Bias  by  a  factor 
smaller  than  one  order  of  magnitude  was  observed  only  for  very  large  sample  sizes  (>106) 
and  only  for  the  commonly-used  uncorrected  Chaol  estimator  (Chao,  1984,  1987). 
Furthermore,  we  showed  that  the  bias-corrected  Chaol  does  not  correct  for  the  biased 
(low)  estimates  of  the  diversity.  Although  the  Chaol  has  been  cited  as  giving  a  lower 
bound  estimate  for  heterogeneous  environments  (Mao  and  Lindsay,  200 1 ;  Bohannan  and 
Hughes,  2003)  and  a  small  sample  size  (Kemp  and  Aller,  2004),  we  find  it  5  standard 
deviations  away  from  the  mean  of  the  MLE  of  lognormal  distribution  (Fig.  ID). 
Furthermore,  the  Chaol  estimates  of  the  number  of  ribotypes  for  the  sample  size  of  1,033 
are  narrowly  distributed  around  a  mean  of  1 ,673,  with  the  largest  standard  deviation  of 
only  133  (Fig.  ID).  By  contrast,  the  ML  estimates  were  essentially  unbiased  with  a 
variance  consistent  with  the  curvature  of  the  likelihood  function.  In  actuality,  the  non- 
parametric  methods  should  give  a  higher  uncertainty  than  parametric  methods  because 
they  are  not  constrained  by  the  assumed  underlying  distribution  of  taxon  abundances. 
However,  the  variance  of  Chaol  is  surprisingly  small  and  as  such  is  a  very  misleading 
measure  of  uncertainty  (Fig.  ID). 

We  observed  that  the  Poisson-lognormal  model  results  in  a  tighter  fit  to  the  100% 
rarefaction  curve  (Fig.  1  A)  in  comparison  to  the  97%  rarefaction  curve  (Fig.  IB).  It  is 
possible  that  the  ribotype  and  cluster  data  may  contain  two  different  underlying 


72 


distributions  and  that  these  result  in  the  differential  fit  between  the  two  data  sets.  Lunn  et 
al.  (2004)  pointed  out  that  differences  in  the  underlying  abundance  distribution  of  taxa 
may  be  a  function  of  a  chosen  taxonomic  resolution,  e.g.  species  versus  genera. 

However,  closer  inspection  of  the  data  set  is  required  before  we  can  determine  whether 
the  difference  of  the  underlying  distributions  is  indeed  due  to  taxonomic  resolution. 

Although  distribution  of  ribotypes  in  the  bacterioplankton  clone  library  was  best 
described  with  the  Poisson-lognormal  model,  it  does  not  necessarily  mean  that  lognormal 
is  the  only  distribution  available  to  explain  the  given  data  set.  From  the  abundance  data 
we  only  have  the  information  on  the  upper  portion  of  the  underlying  distribution.  Thus, 
we  can  only  provide  a  good  fit  for  this  portion  of  the  curve.  While  some  distributions  can 
be  rejected  with  authority  (i.e.  gamma  distributions),  the  data  could  be  fitted  equally  well 
with  the  power-law  distribution.  However,  this  has  not  yet  been  tested  with  the  present 
bacterioplankton  library.  The  data  of  the  bacterioplankton  sample  in  Figures  3C  and  3D 
are  plotted  together  with  the  power  trend  line  as  well  as  a  sample  taken  from  a  simulated 
lognormal  community.  Indeed,  from  the  observations  of  the  data  alone,  one  cannot 
conclude  which  distribution  (power-law  or  lognormal)  would  produce  a  better  fit  (Figs. 
3C  and  3D).  However,  the  power-law  model  is  likely  to  produce  a  similar  order  of 
magnitude  estimate  as  the  lognormal  model. 

Independent  of  the  statistical  approaches  used  to  estimate  diversity,  an  important 
question  remains:  to  what  extent  can  the  diversity  observed  for  a  given  clone  library  be 
extrapolated  to  a  sampled  microbial  community?  The  construction  of  the  libraries 
themselves  and  PCR  amplification  may  introduce  artifacts  and  biases,  which  may  alter 
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the  estimated  diversity  and  abundance  distribution  (Suzuki  and  Giovannoni,  1996;  Polz 
and  Cavanaugh,  1998;  Speksnijder  et  al.,  2001;  Thompson  et  al.,  2002).  The 
bacterioplankton  data  set  was  corrected  for  PCR  artifacts  such  as  Taq  errors,  chimeras 
and  heteroduplexes  and  it  can  be  reasonably  assumed  that  no  additional  taxa  were  added 
to  the  libraries.  However,  it  is  likely  that  universal  primers  used  in  the  PCR  failed  to 
amplify  some  organisms  from  the  sampled  community.  It  has  been  shown  that  a  number 
of  16S  rRNA  primers  previously  thought  to  be  universally  conserved  are  in  fact  not 
conserved  (Vergin  et  al.,  1998;  Daims  et  al.,  1999).  Therefore,  the  estimates  for  the 
library  should  be  regarded  as  minimum  diversity  estimates  for  the  sampled  environment. 

In  addition,  the  bias  introduced  due  to  preferential  amplification  of  some 
templates  (Suzuki  and  Giovannoni,  1996)  or  due  to  relative  abundances  of  multiple 
rRNA  operons  remains  an  open  question.  These  biases  can  potentially  distort  the  inferred 
distribution  of  abundances  of  taxa  in  the  sampled  community.  However,  the  extent  to 
which  they  contribute  to  the  standard  deviation  of  the  taxon  log-abundances,  o,  observed 
for  the  clone  library  cannot  be  entirely  resolved  at  this  point.  We  do,  however,  know  that 
the  ML  estimator  is  essentially  unbiased  for  both  total  number  of  taxa,  N,  and  a,  and  that 
the  value  of  a  is  large  for  our  library.  This  suggests  that  (i)  the  experimental  bias  would 
have  to  be  very  large  for  the  distribution  of  ribotypes  in  the  library  to  be  different  from 
the  underlying  distribution  of  the  community  and  (ii)  reducing  this  bias  would  not  change 
significantly  N,  but  only  result  in  the  more  accurate  estimates  of  N. 

The  large  diversity  of  -25,000  ribotypes  suggested  to  co-exist  in  the  coastal 
bacterioplankton  sample  raises  several  questions.  Firstly,  how  many  individual  genomes 
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underlie  the  ribotype  diversity?  Secondly,  to  what  extent  can  all  these  types  actually  be 
ecologically  differentiated?  We  have  previously  suggested,  based  on  extensive  analysis 
of  all  available  completely  sequenced  genomes,  that  on  average  the  number  of  ribotypes 
exceeds  the  number  of  genomes  by  ~2.5  fold  (Acinas  et  al.,  2004a).  Thus,  correcting  the 
value  of  25,000  ribotypes  for  the  contribution  of  multiple  operons,  a  value  of  10,000 
genomes  is  obtained.  If  these  were  ecologically  differentiated,  the  functional  diversity 
within  our  sample  would  indeed  be  striking.  We  previously  determined  that  the  number 
of  individuals  from  which  our  clone  library  was  constructed  was  ~106  bacterial  cells. 
Therefore,  if  the  genomes  were  uniformly  distributed  in  the  sample  each  would  only  be 
represented  by  ~  100  individuals,  but  under  the  lognormal  distribution,  which  is  suggested 
for  this  sample,  most  genomes  would  be  present  at  a  very  small  fraction.  However,  it  is 
highly  unlikely  that  individual  ribotypes  represent  functional  units,  because  functional 
units  would  be  then  composed  of  only  a  few  individuals.  It  appears  more  likely  that  the 
functional  units  are  formed  of  microdiverse  clusters  of  ribotypes,  implying  that  the 
functional  diversity  may  be  far  lower  than  the  observed  ribotype  diversity. 

In  fact,  we  have  previously  suggested  that  individual  genomes  may  not  actually 
represent  distinct  functional  units  within  the  community  but  that  such  units  are 
represented  as  microdiverse  ribotype  clusters  (Chapter  2  and  3).  These  clusters  may  arise 
by  selective  sweeps  and  persist  because  competitive  mechanisms  are  too  weak  to  purge 
diversity  from  within  them  (Acinas  et  al.,  2004b).  Indeed,  the  large  diversity  estimate  of 
ribotypes  supports  these  previous  suggestions.  In  addition,  we  observed  that  the  Poisson- 
lognormal  MLE  value  for  the  97%  similarity  groups  data  set  is  ~ 1,500  taxa  suggesting 
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that  the  vast  majority  of  the  ribotype  diversity  (85%)  is  contained  within  3%  sequence 
divergence  and  therefore  very  likely  organized  into  microdiverse  clusters. 

In  summary,  this  study  reveals  previously  unsuspected  diversity  within  a  complex 
marine  bacterial  community.  This  is  evident  from  development  and  application  of 
parametric  methods  based  on  maximum  likelihood.  Using  these  methods,  we  are  able, 
for  the  first  time,  to  constrain  the  value  of  diversity  estimates  and  critically  evaluate  the 
commonly  applied  Chaol  non-parametric  estimator.  We  are  confident  that  the  Chaol 
estimator  significantly  underestimates  diversity  of  complex  microbial  communities. 

Most  importantly  we  show  that  the  diversity  of  complex  microbial  communities  could  be 
much  greater  than  previously  thought. 
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Table  1.  Ribotype  abundances  observed  in  the  bacterioplankton  sample. 


Abundance  of 
ribotypes 

Number  of 
ribotypes 

1 

381 

2 

65 

3 

23 

4 

18 

5 

4 

6 

6 

7 

3 

9 

1 

11 

4 

13 

3 

14 

2 

16 

1 

21 

1 

27 

1 

32 

1 

43 

1 

45 

1 
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Figure  1.  Likelihood  surface  for  the  compound  Poisson-Lognormal  model.  The 
variables  on  the  axes  are  the  total  number  of  taxa  N  and  the  standard  deviation  of  the 
lognormal  distribution  o.  In  all  four  parts  of  the  figure,  black  contours  represent  estimates 
of  the  parameter  pair  values,  N  and  a,  1-5  standard  deviations  away  from  the  mean  (the 
most  likely  estimate  for  the  total  number  of  types).  A.  Likelihood  surface  of  the 
modified  Poisson-lognormal  model;  generated  as  a  product  of  the  two  likelihoods  shown 
in  B  and  C.  B.  Likelihood  surface  for  a  likelihood  function  obtained  from  the 
information  on  sample  abundances  in  the  sample  5  (equivalent  to  the  Bulmer’s  likelihood 
function  (Bulmer,  1974)).  C.  Likelihood  surface  for  a  likelihood  function  obtained  from 
the  information  on  the  number  of  types  observed  in  the  sample  s.  D.  Likelihood  surface 
for  the  compound  Poisson-Lognormal  model  and  the  most  likely  estimate  of  the  total 
number  of  types  in  a  bacterioplankton  community  sample  (black  cross);  averages  of  10 
samples  from  each  of  the  10  simulated  communities  estimated  for  the  Poisson-lognormal 
model  using  the  modified  likelihood  function  (red  solid  circles);  averages  of  10  samples 
from  each  of  the  10  simulated  communities  estimated  for  the  Poisson-lognormal  model 
using  the  Bulmer’s  likelihood  function  (blue  open  circles);  averages  of  the  10  Chaol 
values  of  the  10  samples  from  the  each  of  the  10  simulated  communities  (green  solid 
diamonds). 

Figure  2.  Likelihood  surface  for  the  Poisson-gamma  model  showing  relative  support  for 
each  pair  of  parameter  estimates  of  the  number  of  types  N  and  ln(o).  Black  contours  (1  - 


82 


5)  represent  estimates  of  the  parameter  pair  values,  N  and  a,  1-5  standard  deviations 
away  from  the  mean  (the  most  likely  estimate  for  the  total  number  of  types). 

Figure  3.  Lognormal  and  gamma  distributions  under  the  Poisson  sampling  fitted  to  data 
by  the  method  of  maximum  likelihood  and  compared  to  the  rarefaction  curve  of  the 
actual  data.  A.  Rarefaction  curve  of  the  bacterioplankton  sample  -  taxa  constructed  from 
100%  sequence  similarity  clustering  (blue  solid  diamonds);  Poisson-lognormal 
distribution  (red  solid  circles);  Chaol  estimate  under  the  Poisson-Lognormal  (orange 
open  circles);  Curtis  estimate  under  the  Poisson-Lognormal  (green  open  circles);  Poisson- 
Gamma  distribution  (black  star);  and  MacArthur’s  broken  stick  distribution  (black  cross). 
B.  Rarefaction  curve  of  the  bacterioplantkon  sample  -  taxa  constructed  from  97% 
sequence  similarity  clustering  (blue  solid  diamonds);  Poisson-lognormal  distribution  (red 
solid  circles);  Chaol  estimate  under  the  Poisson-Lognormal  (orange  open  circles);  and 
Poisson-Gamma  distribution  (black  star).  C.  Probability  exceedance  plot  of  the 
bacterioplantkon  sample  -  100%  sequence  similarity  clustering  OTUs  (blue  solid 
diamonds)  and  a  sample  taken  from  a  simulated  lognormal  community  (black  open 
triangles);  and  power  trendline  (black  solid  line).  D.  Probability  exceedance  plot  of  the 
bacterioplantkon  sample  -  97%  sequence  similarity  clustering  OTUs  (blue  solid 
diamonds)  and  a  sample  taken  from  a  simulated  lognormal  community  (black  open 
triangles);  and  power  trendline  (black  solid  line). 


83 


Figure  4.  Chaol  estimates  of  the  total  number  of  different  types  calculated  from  a 
simulated  community  of  25,000  different  types  and  o  of  2.7  using  “bias-corrected”  (green 
open  squares)  and  the  approximate  “uncorrected”  formula  (blue  solid  diamonds)  as  a 
function  of  sample  size.  Error  bars  are  one  standard  deviation  and  were  calculated  with 
the  variance  formula  derived  by  Chao  (1987). 
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Estimated  population  size  average  from  10  simulated  communities  calculated  from 
Poisson-lognormal  model  using  maximum  likelihood  is  shown  in  red  solid  circles.  The 
red  lines  are  error  bars  showing  one  standard  deviation  for  the  Poisson-lognormal  model. 
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Chapter  Five 

Summary  and  Future  Directions 


SUMMARY  AND  FUTURE  DIRECTIONS 


This  thesis  explores  questions  of  microbial  community  structure  in  the  marine 
environment.  Specifically,  two  fundamental  questions  are  investigated:  (i)  how  many 
bacterial  types  co-exist,  and  (ii)  does  a  phylogenetic  structure  exist  that  would  suggest 
units  of  differentiation  among  natural  microbial  communities?  Bacterial  diversity  in  two 
complex  marine  communities  (coastal  bacterioplankton  and  sediment  sulfate-reducing 
bacteria)  was  investigated  by  (i)  comprehensive  analysis  of  large  16S  rRNA  clone 
libraries,  and  (ii)  refinement  and  application  of  parametric  diversity  estimators. 

Several  major  results  are  presented: 

•  development  of  analysis  protocols  and  tools  to  constrain  artifacts  that  may  lead  to 
overestimation  of  diversity  and,  more  significantly,  obscure  patterns  of  community 
organization  (Chapters  2  and  3). 

•  refinement  and  application  of  statistical  methods  to  estimate  microbial  diversity 
(Chapter  4). 

•  demonstration  of  co-existence  of  previously  undetected  high  diversity  within  marine 
bacterial  communities  (Chapters  2,  3,  and  4). 

•  identification  of  patterns  of  community  organization  revealing  predominance  of 
microdiverse  ribotype  clusters  that  are  hypothesized  to  correspond  to  ecologically 
distinct  units  (Chapters  2  and  3). 


Chapter  2  explores  the  extent  of  diversity  and  phylogenetic  structure  of  a  bacterial 
sediment  community.  The  clone  library  investigated  in  this  chapter  was  constructed  from 
16S  rRNA  genes  of  delta-proteobacterial  sulfate-reducing  bacteria  (SRB)  from  salt-marsh 
sediment  samples.  We  observed  unexpected  high  diversity  of  ribotypes  and 
predominance  of  microdi verse  relationships  among  the  co-existing  SRB.  This  high 
diversity  is  indeed  surprising  as  the  SRB  overall  possess  very  similar  metabolism; 
however,  as  detailed  in  Chapter  3  the  actual  functional  units  within  this  community  may 
not  be  individual  ribotypes  but  microdiverse  clusters  of  ribotypes,  so  that  the  functional 
diversity  within  the  sediment  SRB  may  be  far  lower  than  the  observed  ribotype  diversity. 
It  may  be  possible  in  the  future  to  critically  test  hypotheses  on  structure  and  function 
using  this  community  as  a  model  system.  First,  the  SRB  community  was  well  sampled  so 
that  it  is  likely  that  all  major  groups  were  detected.  This  good  coverage  may  serve  as  a 
foundation  to  develop  tools  for  monitoring  specific  SRB  populations  in  order  to  correlate 
their  growth  with  prevalence  of  specific  conditions  within  the  sediment.  For  example,  this 
may  be  achieved  by  application  of  DNA  microarrays  specifically  designed  to 
differentiate  individual  ribotypes  and  ribotype  clusters  within  the  community.  Second, 
SRB  communities  are  overall  well  studied  and  therefore  an  extensive  dataset  exists 
whose  comparison  with  the  current  study  may  reveal  patterns  in  occurrence  of  specific 
SRB  types  in  different  environments.  For  example,  in  the  marsh  sediment  previously 
unidentified  completely  oxidizing  SRB  dominated  but  other  studies  have  also  found 
predominance  of  the  same  group  of  SRB  in  similar  environments. 


Diversity  and  phylogenetic  structure  of  a  coastal  bacterioplankton  community  was 
investigated  in  Chapter  3.  This  community  was  chosen  because  it  differs  in  its  overall 
ecological  features  from  the  sediment  community  and  would  therefore  be  a  test  case  for 
the  general  applicability  of  the  findings  in  Chapter  2.  The  pelagic  environment  is  well 
mixed,  while  sediments  are  highly  structured  environments.  Therefore,  one  may  expect 
that  the  underlying  community  composition  of  the  two  communities  would  be  different. 
For  example,  it  may  be  hypothesized  that  the  efficient  mixing  in  the  pelagic  environment 
may  allow  for  more  efficient  selective  sweeps  within  the  community.  These  would  serve 
to  purge  diversity  leading  to  a  more  simple  overall  community  composition.  However, 
similar  fine-scale  phylogenetic  structures  were  observed  for  the  coastal  community  as 
well.  Although  microdiversity  has  been  previously  suggested  by  analysis  of  specific 
microbial  groups  in  PCR-generated  clone  libraries  (Field  et  al.,  1997;  Garcia-Martinez 
and  Rodriguez-Valera,  2000;  Casamayor  et  al.,  2002),  it  had  remained  unclear  to  what 
extent  microdiversity  arises  by  PCR  induced  artifacts  or  is  the  result  of  paralogous  rRNA 
operons  within  the  same  genome  (Suzuki  and  Giovannoni,  1996;  Polz  and  Cavanaugh, 
1998;  Speksnijder  et  al.,  2001;  Thompson  et  al.,  2002;  Rappe  and  Giovannoni,  2003). 

Application  of  PCR-based  approaches  on  the  16S  rRNA  gene  can  potentially 
affect  the  estimation  of  diversity  in  several  ways;  by  (1)  formation  of  sequence  artifacts 
(Taq  errors,  chimeras,  and  heteroduplexes),  (2)  preferential  amplification  of  some 
templates  over  others,  which  can  skew  sequence  abundances,  (3)  missing  some  of  the 
sequence  diversity  due  to  the  primer  selection,  and  (4)  the  incidence  of  multiple  rRNA 
operons  within  a  single  genome. 


(1)  We  have  developed  methods  that  minimize  and  account  for  chimeras,  Taq 
errors  and  heteroduplex  errors.  It  has  been  shown  empirically  that  the  methods  we 
employed  removed  all  error  due  to  heteroduplex  formation  (Thompson  et  al.,  2002).  The 
remaining  types  of  error  cannot  be  detected  directly  and  must  be  inferred.  We  have 
employed  the  two  most  widely  used  chimera-checking  programs  and  have  written  our 
own  software  specifically  designed  for  clone  libraries  that  have  been  sampled  to  a  high 
degree  (Chapter  2  and  3).  Finally,  we  have  developed  methods  to  identify  and  eliminate 
polymerase  errors  based  on  well-known  patterns  of  primary  and  secondary  structure 
conservation  and  have  provided  several  independent  estimates  of  their  effectiveness 
(Chapter  2  and  3).  It  is  important  to  note  that  one  important  component  of  this  thesis  is 
the  development,  for  the  first  time  of  a  means  to  estimate  the  number  of  rRNA  sequences 
affected  by  TAQ  errors  and  some  guidelines  on  how  to  identify  them. 

(2  &  4)  At  this  point  we  cannot  ascertain  the  extent  of  bias  resulting  from  the 
preferential  amplification  of  templates  (Suzuki  and  Giovannoni,  1996)  or  from  different 
abundances  of  multiple  rRNA  operons.  However,  preliminary  evidence  suggests  that  this 
bias  may  not  be  large.  We  have  conducted  and  published  a  detailed  investigation  of 
operon  heterogeneity  among  published  genomes  (Acinas  et  al.,  2004).  About  40%  of 
genomes  have  1  or  2  operons  and  a  majority  of  their  sequences  are  identical,  indicating 
that  the  number  of  operons  per  genome  is  highly  skewed  towards  the  lower  spectrum. 
Also,  an  extensive  survey  of  97  published  bacterial  genomes  showed  that  a  correction 
factor  of  0.4  can  be  applied  to  estimate  diversity  of  genomes  from  unique  rRNA 
sequences  (Acinas  et  al.,  2004).  Thus,  after  applying  this  correction,  the  bias  stemming 


from  the  preferential  amplification  of  some  templates  and  different  number  of  identical 
operons  may  only  distort  the  distribution  of  ribotype  abundances  of  the  sampled 
community,  but  should  not  change  their  total  number. 

(3)  PCR  using  universal  primers  may  fail  to  amplify  some  organisms  from  a 
sampled  community.  It  has  been  shown  that  several  16S  rRNA  primers  previously 
thought  to  be  universally  conserved  are  in  fact  not  (Vergin  et  al.,  1998;  Daims  et  al., 
1999).  Therefore,  as  we  cannot  exclude  the  possibility  that  the  primers  may  fail  to 
amplify  some  members  of  the  domain  Bacteria,  the  estimates  of  the  diversity  should  be 
regarded  as  a  minimum  diversity  for  a  sampled  community.  This  implies  that  the 
diversity  of  the  sampled  community  may  be  even  higher  than  diversity  reported  for  the 
library.  It  is  important  to  note  that  prior  to  the  amplification  of  the  bacterioplankton 
community,  we  carefully  evaluated  the  existing  universal  primers  27F  and  1492R  and 
modified  these  to  include  members  from  the  order  Planctomycetales  based  on  the 
information  provided  by  Vergin  et  al.  (1998).  The  mismatch  of  universal  primers  to  the 
16S  rRNA  sequences  was  inferred  from  a  survey  of  Planctomycetales  clones  recovered 
from  a  marine  fosmid  library  (Vergin  et  al.,  1998).  Although  it  is  possible  that  these 
modified  primers  may  have  a  mismatch  with  some  unknown  bacterial  16S  rRNA 
sequences,  a  marine  molecular  survey  conducted  without  prior  PCR  amplification 
indicated  that  no  novel  clades  of  the  domain  Bacteria  could  be  observed  (Venter  and  al., 
2004).  Thus,  it  is  very  likely  that  we  detected  at  least  the  dominant  members  of  the 
bacterioplankton  community. 


Overall,  by  (i)  developing  methods  that  minimize  and  account  for  the  contribution 
of  sequence  artifacts,  (ii)  accounting  for  variation  in  multiple  operons  within  single 
genomes,  and  (iii)  improving  the  existing  primers,  we  were  able  to  reduce  the  difference 
of  the  total  number  of  ribotypes  between  the  sampled  communities  and  the  constructed 
large  clone  libraries. 

The  large  size  of  the  clone  libraries  enabled  comparison  of  statistical  approaches 
used  in  estimating  microbial  diversity,  as  well  as  development  and  application  of 
parametric  methods  based  on  maximum  likelihood  (Chapter  4).  With  our  dataset  and 
simulations  we  were  able  to  critically  evaluate  the  estimates  obtained  by  the  commonly 
applied  Chaol  non-parametric  estimator  and  the  bias  associated  with  this  estimator.  In 
addition,  we  evaluated  existing  parametric  methods  as  they  should  perform  better  for 
diverse  microbial  communities,  and  modified  them  to  better  account  for  the  specific  way 
microbial  communities  are  sampled.  The  diversity  estimated  using  the  parametric 
approach  revealed  an  even  higher  number  of  co-existing  microdiverse  sequences,  and  the 
estimated  diversity  was  over  one  order  of  magnitude  higher  than  that  suggested  by 
common  non-parametric  approaches.  However,  when  sequences  were  clustered  into  97% 
sequence  similarity  groups,  diversity  estimates  increased  by  a  much  smaller  factor 
compared  to  the  Chaol.  This  suggests  that  the  overall  pattern  of  predominance  of 
microdiverse  clusters  is  strongly  confirmed  by  the  new  analysis  tools. 

Overall,  the  compensation  for  artifacts  and  improved  estimation  revealed  that  the 
vast  majority  of  ribotypes  fall  into  microdiverse  clusters  containing  <1%  sequence 
divergence.  Whether  the  observed  ribotype  clusters  represent  ecotypes,  i.e.  ecologically 
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cohesive  populations,  will  have  to  be  determined  by  detailed  examination  of  the 
environmental  dynamics  of  genomic  variants.  It  is  proposed  that  the  observed 
microdiverse  clusters  form  important  units  of  differentiation  in  microbial  communities. 
They  are  hypothesized  to  arise  by  selective  sweeps  and  contain  high  diversity  because 
competitive  mechanisms  are  too  weak  to  purge  diversity  from  within  them. 
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GLOSSARY 


Bacterioplankton  -  all  bacteria  that  passively  drift  in  lakes  and  ocean 

Chimera  -  a  sequence  formed  from  two  different  sequences. 

Clone  -  a  lineage  of  individuals  produced  asexually. 

Cluster  -  a  set  of  sequences  grouped  by  some  sequence  similarity  cutoff. 

Diversity  -  the  heterogeneity  of  a  system;  the  variety  of  different  types  of  organisms 
occurring  together  in  a  biological  community. 

Ecological  niche  -  the  functional  role  of  an  organism  within  an  ecosystem;  the  combined 
description  of  the  physical  habitat,  functional  role,  and  interactions  of  the 
microorganisms. 

Guild  -  populations  within  a  community,  which  use  the  same  resources. 
m  -  sample  abundances  (Chapter  4). 

M  -  population  abundances  (Chapter  4). 
n  -  a  number  of  taxa  in  a  sample  (Chapter  4). 

N  -  the  total  number  of  taxa  in  a  population  (Chapter  4). 

Non-parametric  diversity  estimators  -  diversity  estimators  that  assume  no  models  of 
distribution  of  taxon  abundances. 

OTU  -  operational  taxonomic  unit. 

Parametric  diversity  estimators  -  diversity  estimators  that  assume  abundance 
distribution  of  taxa. 

Phylogeny  -  the  line  or  lines,  of  direct  descent  in  a  given  group  of  organisms;  also  the 
study  or  the  history  of  such  relationships. 

Population  -  the  set  of  data  from  which  a  statistical  sample  is  taken  (Chapter  4>. 
Population  abundance  -  taxon  abundance  in  a  population  (Chapter  4). 
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R  -  relative  abundance  of  taxon  (Chapter  4). 

Ribotype  -  unique  rRNA  sequence. 

a-  standard  deviation  of  the  taxa  log-abundances  (Chapter  4). 
s  -  sample  size  (Chapter  4). 

Sample  -  a  set  of  sequences  from  a  clone  library  used  in  diversity  estimates  (Chapter  4). 
Sample  abundance  -  taxon  abundance  in  a  sample. 

Taxon  -  a  taxonomic  category  or  a  group. 
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